Featurisation & Model Tuning Project
Table of Contents¶
Tasks Planned
- Import libraries
- Data Reading and understanding
- Data cleansing
- Delete features those have more than 20% null values
- Delete features having same values in all rows
- Check features those having continuous variables but having very less unique values
- Delete features with continuous variables and having maximum (> 85%) zeros.
- Transform Time variable in Year, month, day and dayof week column. Delete Time and Year column
- Check for multicolinearity and drop correlated features while keeping 1st variable in corelation pair.
- Check for features with very low coefficient of variation. Drop such columns
- Check for outliers and treat them with capping mechanism.
- EDA : Univariate, Bivariate and multivariate analysis
- Histoplot for all features
- Boxplot with all features
- Pie chart stating distribution of target variable
- Scatterplot of features against target variable' only for 20 variables those are highly corelated with target variable showing
- Barplot of features against target variable' only for 20 variables those are highly corelated with target variable showing
- Violin of features against target variable' only for 20 variables those are highly corelated with target variable showing
- Heatmap for 30 highly correlated variables
- Data preprocessing
- Split data into X, y
- Balacing using SMOTE
- Split data into train and test.
- Standardize data
- Comparing statistics (except count) for train, test with original data
- Model building
- Define Goal statement
- Define user defined functions to store and display results/metrics of models
- Train maodel on original data using Logistics regression or Random forest
- Check cross validation score for trained models using KFold and Skf cross validation techniques
- Hyperparameter tuning on one of the model using GridSearchCv
- PCA dimentionality reduction on original balanced scaled data. Split data into train, test for further model building on new data.
- Train Random Forest model with PCA data and then tune model using hyper parameters on same PCA data. Find cross validation scores
- Print output and classification reports
- Repeat same steps for other models using Pipeline. -- Define Pipeline and assign varius models into it. -- Train all pipeline models on original balanced scaled data. Perform cross validation on these trained models -- Train all pipeline models on PCA transformed data. Tune model using parameters and gridsearchCV
- Post Training and Conclusion
- Display performance of all models
- Find best model
- Choose model for future
- Conclude
Common reusable functions used for model building and performance measurement
AddModelResults : This is to store results of each model in results dataframe. Results can be used to derive best models.
UpdateKFoldSKFScores : Update cross validation values for particular model in result dataframe
Modelfit_print : This is for model building and performance printing. It performs:
- Print performance metrics
- Call function to store data in results dataframe
#Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
import pickle
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
Q.1. Import and understand the data¶
Q.1.A. Import ‘signal-data.csv’ as DataFrame.¶
df_signal = pd.read_csv('signal-data.csv')
print(df_signal.shape)
df_signal.head()
(1567, 592)
| Time | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | Pass/Fail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2008-07-19 11:55:00 | 3030.93 | 2564.00 | 2187.7333 | 1411.1265 | 1.3602 | 100.0 | 97.6133 | 0.1242 | 1.5005 | ... | NaN | 0.5005 | 0.0118 | 0.0035 | 2.3630 | NaN | NaN | NaN | NaN | -1 |
| 1 | 2008-07-19 12:32:00 | 3095.78 | 2465.14 | 2230.4222 | 1463.6606 | 0.8294 | 100.0 | 102.3433 | 0.1247 | 1.4966 | ... | 208.2045 | 0.5019 | 0.0223 | 0.0055 | 4.4447 | 0.0096 | 0.0201 | 0.0060 | 208.2045 | -1 |
| 2 | 2008-07-19 13:17:00 | 2932.61 | 2559.94 | 2186.4111 | 1698.0172 | 1.5102 | 100.0 | 95.4878 | 0.1241 | 1.4436 | ... | 82.8602 | 0.4958 | 0.0157 | 0.0039 | 3.1745 | 0.0584 | 0.0484 | 0.0148 | 82.8602 | 1 |
| 3 | 2008-07-19 14:43:00 | 2988.72 | 2479.90 | 2199.0333 | 909.7926 | 1.3204 | 100.0 | 104.2367 | 0.1217 | 1.4882 | ... | 73.8432 | 0.4990 | 0.0103 | 0.0025 | 2.0544 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
| 4 | 2008-07-19 15:22:00 | 3032.24 | 2502.87 | 2233.3667 | 1326.5200 | 1.5334 | 100.0 | 100.3967 | 0.1235 | 1.5031 | ... | NaN | 0.4800 | 0.4766 | 0.1045 | 99.3032 | 0.0202 | 0.0149 | 0.0044 | 73.8432 | -1 |
5 rows × 592 columns
df_signal.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1567 entries, 0 to 1566 Columns: 592 entries, Time to Pass/Fail dtypes: float64(590), int64(1), object(1) memory usage: 7.1+ MB
Q.1.B Print 5 point summary and share at least 2 observations.¶
#lets use describe function
summary_stats = df_signal.describe()
# Transpose the summary statistics DataFrame for better readability
summary_stats = summary_stats.transpose()
# Print the five-number summary for each numerical feature
print(summary_stats[['min', '25%', '50%', '75%', 'max']])
min 25% 50% 75% max 0 2743.2400 2966.260000 3011.49000 3056.650000 3356.3500 1 2158.7500 2452.247500 2499.40500 2538.822500 2846.4400 2 2060.6600 2181.044400 2201.06670 2218.055500 2315.2667 3 0.0000 1081.875800 1285.21440 1591.223500 3715.0417 4 0.6815 1.017700 1.31680 1.525700 1114.5366 5 100.0000 100.000000 100.00000 100.000000 100.0000 6 82.1311 97.920000 101.51220 104.586700 129.2522 7 0.0000 0.121100 0.12240 0.123800 0.1286 8 1.1910 1.411200 1.46160 1.516900 1.6564 9 -0.0534 -0.010800 -0.00130 0.008400 0.0749 10 -0.0349 -0.005600 0.00040 0.005900 0.0530 11 0.6554 0.958100 0.96580 0.971300 0.9848 12 182.0940 198.130700 199.53560 202.007100 272.0451 13 0.0000 0.000000 0.00000 0.000000 0.0000 14 2.2493 7.094875 8.96700 10.861875 19.5465 15 333.4486 406.127400 412.21910 419.089275 824.9271 16 4.4696 9.567625 9.85175 10.128175 102.8677 17 0.5794 0.968200 0.97260 0.976800 0.9848 18 169.1774 188.299825 189.66420 192.189375 215.5977 19 9.8773 12.460000 12.49960 12.547100 12.9898 20 1.1797 1.396500 1.40600 1.415000 1.4534 21 -7150.2500 -5933.250000 -5523.25000 -5356.250000 0.0000 22 0.0000 2578.000000 2664.00000 2841.750000 3656.2500 23 -9986.7500 -4371.750000 -3820.75000 -3352.750000 2363.0000 24 -14804.5000 -1476.000000 -78.75000 1377.250000 14106.0000 25 0.0000 1.094800 1.28300 1.304300 1.3828 26 0.0000 1.906500 1.98650 2.003200 2.0528 27 0.0000 5.263700 7.26470 7.329700 7.6588 28 59.4000 67.377800 69.15560 72.266700 77.9000 29 0.6667 2.088900 2.37780 2.655600 3.5111 30 0.0341 0.161700 0.18670 0.207100 0.2851 31 2.0698 3.362700 3.43100 3.531300 4.8044 32 83.1829 84.490500 85.13545 85.741900 105.6038 33 7.6032 8.580000 8.76980 9.060600 23.3453 34 49.8348 50.252350 50.39640 50.578800 59.7711 35 63.6774 64.024800 64.16580 64.344700 94.2641 36 40.2289 49.421200 49.60360 49.747650 50.1652 37 64.9193 66.040650 66.23180 66.343275 67.9586 38 84.7327 86.578300 86.82070 87.002400 88.4188 39 111.7128 118.015600 118.39930 118.939600 133.3898 40 1.4340 74.800000 78.29000 80.200000 86.1200 41 -0.0759 2.690000 3.07400 3.521000 37.8800 42 70.0000 70.000000 70.00000 70.000000 70.0000 43 342.7545 350.801575 353.72090 360.772250 377.2973 44 9.4640 9.925425 10.03485 10.152475 11.0530 45 108.8464 130.728875 136.40000 142.098225 176.3136 46 699.8139 724.442300 733.45000 741.454500 789.7523 47 0.4967 0.985000 1.25105 1.340350 1.5111 48 125.7982 136.926800 140.00775 143.195700 163.2509 49 1.0000 1.000000 1.00000 1.000000 1.0000 50 607.3927 625.928425 631.37090 638.136325 667.7418 51 40.2614 115.508975 183.31815 206.977150 258.5432 52 0.0000 0.000000 0.00000 0.000000 0.0000 53 3.7060 4.574000 4.59600 4.617000 4.7640 54 3.9320 4.816000 4.84300 4.869000 5.0110 55 2801.0000 2836.000000 2854.00000 2874.000000 2936.0000 56 0.8755 0.925450 0.93100 0.933100 0.9378 57 0.9319 0.946650 0.94930 0.952000 0.9598 58 4.2199 4.531900 4.57270 4.668600 4.8475 59 -28.9882 -1.871575 0.94725 4.385225 168.1455 60 324.7145 350.596400 353.79910 359.673600 373.8664 61 9.4611 10.283000 10.43670 10.591600 11.7849 62 81.4900 112.022700 116.21180 120.927300 287.1509 63 1.6591 10.364300 13.24605 16.376100 188.0923 64 6.4482 17.364800 20.02135 22.813625 48.9882 65 4.3080 23.056425 26.26145 29.914950 118.0836 66 632.4226 698.770200 706.45360 714.597000 770.6084 67 0.4137 0.890700 0.97830 1.065000 7272.8283 68 87.0255 145.237300 147.59730 149.959100 167.8309 69 1.0000 1.000000 1.00000 1.000000 1.0000 70 581.7773 612.774500 619.03270 625.170000 722.6018 71 21.4332 87.484200 102.60430 115.498900 238.4775 72 -59.4777 145.305300 152.29720 158.437800 175.4132 73 456.0447 464.458100 466.08170 467.889900 692.4256 74 0.0000 0.000000 0.00000 0.000000 4.1955 75 -0.1049 -0.019550 -0.00630 0.007100 0.2315 76 -0.1862 -0.051900 -0.02890 -0.006500 0.0723 77 -0.1046 -0.029500 -0.00990 0.009250 0.1331 78 -0.3482 -0.047600 -0.01250 0.012200 0.2492 79 -0.0568 -0.010800 0.00060 0.013200 0.1013 80 -0.1437 -0.044500 -0.00870 0.009100 0.1186 81 -0.0982 -0.027200 -0.01960 -0.012000 0.0584 82 -0.2129 -0.018000 0.00760 0.026900 0.1437 83 5.8257 7.104225 7.46745 7.807625 8.9904 84 0.1174 0.129800 0.13300 0.136300 0.1505 85 0.1053 0.110725 0.11355 0.114900 0.1184 86 2.2425 2.376850 2.40390 2.428600 2.5555 87 0.7749 0.975800 0.98740 0.989700 0.9935 88 1627.4714 1777.470300 1809.24920 1841.873000 2105.1823 89 0.1113 0.169375 0.19010 0.200425 1.4727 90 7397.3100 8564.689975 8825.43510 9065.432400 10746.6000 91 -0.3570 -0.042900 0.00000 0.050700 0.3627 92 -0.0126 -0.001200 0.00040 0.002000 0.0281 93 -0.0171 -0.001600 -0.00020 0.001000 0.0133 94 -0.0020 -0.000100 0.00000 0.000100 0.0011 95 -0.0009 0.000000 0.00000 0.000100 0.0009 96 -1.4803 -0.088600 0.00390 0.122000 2.5093 97 0.0000 0.000000 0.00000 0.000000 0.0000 98 -5.2717 -0.218800 0.00000 0.189300 2.5698 99 -0.5283 -0.029800 0.00000 0.029800 0.8854 100 -0.0030 -0.000200 0.00000 0.000200 0.0023 101 -0.0024 -0.000100 0.00000 0.000100 0.0017 102 -0.5353 -0.035700 0.00000 0.033600 0.2979 103 -0.0329 -0.011800 -0.01010 -0.008200 0.0203 104 -0.0119 -0.000400 0.00000 0.000400 0.0071 105 -0.0281 -0.001900 -0.00020 0.001100 0.0127 106 -0.0133 -0.001000 0.00020 0.001600 0.0172 107 -0.5226 -0.048600 0.00000 0.049000 0.4856 108 -0.3454 -0.064900 -0.01120 0.038000 0.3938 109 0.7848 0.978800 0.98100 0.982300 0.9842 110 88.1938 100.389000 101.48170 102.078100 106.9227 111 213.0083 230.373800 231.20120 233.036100 236.9546 112 0.0000 0.459300 0.46285 0.466425 0.4885 113 0.8534 0.938600 0.94640 0.952300 0.9763 114 0.0000 0.000000 0.00000 0.000000 0.0414 115 544.0254 721.023000 750.86140 776.781850 924.5318 116 0.8900 0.989500 0.99050 0.990900 0.9924 117 52.8068 57.978300 58.54910 59.133900 311.7344 118 0.5274 0.594100 0.59900 0.603400 0.6245 119 0.8411 0.964800 0.96940 0.978300 0.9827 120 5.1259 6.246400 6.31360 6.375850 7.5220 121 15.4600 15.730000 15.79000 15.860000 16.0700 122 1.6710 3.202000 3.87700 4.392000 6.8890 123 15.1700 15.762500 15.83000 15.900000 16.1000 124 15.4300 15.722500 15.78000 15.870000 16.1000 125 0.3122 0.974400 1.14400 1.338000 2.4650 126 2.3400 2.572000 2.73500 2.873000 3.9910 127 0.3161 0.548900 0.65390 0.713500 1.1750 128 0.0000 3.074000 3.19500 3.311000 3.8950 129 -3.7790 -0.898800 -0.14190 0.047300 2.4580 130 0.4199 0.688700 0.75875 0.814500 0.8884 131 0.9936 0.996400 0.99775 0.998900 1.0190 132 2.1911 2.277300 2.31240 2.358300 2.4723 133 980.4510 999.996100 1004.05000 1008.670600 1020.9944 134 33.3658 37.347250 38.90260 40.804600 64.1287 135 58.0000 92.000000 109.00000 127.000000 994.0000 136 36.1000 90.000000 134.60000 181.000000 295.8000 137 19.2000 81.300000 117.70000 161.600000 334.7000 138 19.8000 50.900100 55.90010 62.900100 141.7998 139 0.0000 243.786000 339.56100 502.205900 1770.6909 140 0.0319 0.131700 0.23580 0.439100 9998.8944 141 0.0000 0.000000 0.00000 0.000000 0.0000 142 1.7400 5.110000 6.26000 7.500000 103.3900 143 0.0000 0.003300 0.00390 0.004900 0.0121 144 0.0324 0.083900 0.10750 0.132700 0.6253 145 0.0214 0.048000 0.05860 0.071800 0.2507 146 0.0227 0.042300 0.05000 0.061500 0.2479 147 0.0043 0.010000 0.01590 0.021300 0.9783 148 1.4208 6.359900 7.91730 9.585300 742.9421 149 0.0000 0.000000 0.00000 0.000000 0.0000 150 1.3370 4.459250 5.95100 8.275000 22.3180 151 2.0200 8.089750 10.99350 14.347250 536.5640 152 0.1544 0.373750 0.46870 0.679925 924.3780 153 0.0036 0.007275 0.01110 0.014900 0.2389 154 1.2438 5.926950 7.51270 9.054675 191.5478 155 0.1400 0.240000 0.32000 0.450000 12.7100 156 0.0111 0.036250 0.04870 0.066700 2.2016 157 0.0118 0.027050 0.03545 0.048875 0.2876 158 234.0996 721.675050 1020.30005 1277.750125 2505.2998 159 0.0000 411.000000 623.00000 966.000000 7791.0000 160 0.0000 295.000000 438.00000 625.000000 4170.0000 161 0.0000 1321.000000 2614.00000 5034.000000 37943.0000 162 0.0000 451.000000 1784.00000 6384.000000 36871.0000 163 0.0000 0.091000 0.12000 0.154000 0.9570 164 0.0000 0.068000 0.08900 0.116000 1.8170 165 0.0000 0.132000 0.18400 0.255000 3.2860 166 0.8000 2.100000 2.60000 3.200000 21.1000 167 0.3000 0.900000 1.20000 1.500000 16.3000 168 0.0330 0.090000 0.11900 0.151000 0.7250 169 0.0460 0.230000 0.41200 0.536000 1.1430 170 0.2979 0.575600 0.68600 0.797300 1.1530 171 0.0089 0.079800 0.11250 0.140300 0.4940 172 0.1287 0.276600 0.32385 0.370200 0.5484 173 0.2538 0.516800 0.57760 0.634500 0.8643 174 0.1287 0.276500 0.32385 0.370200 0.5484 175 0.4616 0.692200 0.76820 0.843900 1.1720 176 0.0735 0.196250 0.24290 0.293925 0.4411 177 0.0470 0.222000 0.29900 0.423000 1.8580 178 0.0000 0.000000 0.00000 0.000000 0.0000 179 0.0000 0.000000 0.00000 0.000000 0.0000 180 9.4000 16.850000 18.69000 20.972500 48.6700 181 0.0930 0.378000 0.52400 0.688750 3.5730 182 3.1700 7.732500 10.17000 13.337500 55.0000 183 5.0140 21.171500 27.20050 31.687000 72.9470 184 0.0297 0.102200 0.13260 0.169150 3.2283 185 1.9400 5.390000 6.73500 8.450000 267.9100 186 0.0000 0.000000 0.00000 0.000000 0.0000 187 6.2200 14.505000 17.86500 20.860000 307.9300 188 6.6130 24.711000 40.20950 57.674750 191.8300 189 0.0000 0.000000 0.00000 0.000000 0.0000 190 0.0000 0.000000 0.00000 0.000000 0.0000 191 0.0000 0.000000 0.00000 0.000000 0.0000 192 0.0000 0.000000 0.00000 0.000000 0.0000 193 0.0000 0.000000 0.00000 0.000000 0.0000 194 0.0000 0.000000 0.00000 0.000000 0.0000 195 0.0800 0.218000 0.25900 0.296000 4.8380 196 1.7500 5.040000 6.78000 9.555000 396.1100 197 9.2200 17.130000 19.37000 21.460000 252.8700 198 0.0900 0.296000 0.42400 0.726000 10.0170 199 2.7700 6.740000 8.57000 11.460000 390.1200 200 3.2100 14.155000 17.23500 20.162500 199.6200 201 0.0000 5.020000 6.76000 9.490000 126.5300 202 0.0000 6.094000 8.46200 11.953000 490.5610 203 7.7280 24.653000 30.09700 33.506000 500.3490 204 0.0429 0.114300 0.15820 0.230700 9998.4483 205 2.3000 6.040000 7.74000 9.940000 320.0500 206 0.0000 0.000000 0.00000 0.000000 2.0000 207 4.0100 16.350000 19.72000 22.370000 457.6500 208 5.3590 56.158000 73.24800 90.515000 172.3490 209 0.0000 0.000000 0.00000 0.000000 46.1500 210 0.0319 0.065600 0.07970 0.099450 0.5164 211 0.0022 0.043800 0.05320 0.064200 0.3227 212 0.0071 0.032500 0.04160 0.062450 0.5941 213 0.0037 0.036400 0.05600 0.073700 1.2837 214 0.0193 0.056800 0.07540 0.093550 0.7615 215 0.0059 0.063200 0.08250 0.098300 0.3429 216 0.0097 0.069550 0.08460 0.097550 0.2828 217 0.0079 0.045800 0.06170 0.086350 0.6744 218 1.0340 2.946100 3.63075 4.404750 8.8015 219 0.0007 0.002300 0.00300 0.003800 0.0163 220 0.0057 0.007800 0.00895 0.010300 0.0240 221 0.0200 0.040200 0.06090 0.076500 0.2305 222 0.0003 0.001400 0.00230 0.005500 0.9911 223 32.2637 95.147350 119.43600 144.502800 1768.8802 224 0.0093 0.029775 0.03980 0.061300 1.4361 225 168.7998 718.725350 967.29980 1261.299800 3601.2998 226 0.0000 0.000000 0.00000 0.000000 0.0000 227 0.0062 0.013200 0.01650 0.021200 0.1541 228 0.0072 0.012600 0.01550 0.020000 0.2133 229 0.0000 0.000000 0.00000 0.000000 0.0000 230 0.0000 0.000000 0.00000 0.000000 0.0000 231 0.0000 0.000000 0.00000 0.000000 0.0000 232 0.0000 0.000000 0.00000 0.000000 0.0000 233 0.0000 0.000000 0.00000 0.000000 0.0000 234 0.0000 0.000000 0.00000 0.000000 0.0000 235 0.0000 0.000000 0.00000 0.000000 0.0000 236 0.0000 0.000000 0.00000 0.000000 0.0000 237 0.0000 0.000000 0.00000 0.000000 0.0000 238 0.0013 0.003700 0.00460 0.005700 0.0244 239 0.0014 0.003600 0.00440 0.005300 0.0236 240 0.0000 0.000000 0.00000 0.000000 0.0000 241 0.0000 0.000000 0.00000 0.000000 0.0000 242 0.0000 0.000000 0.00000 0.000000 0.0000 243 0.0000 0.000000 0.00000 0.000000 0.0000 244 0.0003 0.001200 0.00170 0.002600 1.9844 245 0.2914 0.911500 1.18510 1.761800 99.9022 246 1.1022 2.725900 3.67300 4.479700 237.1837 247 0.0000 0.019200 0.02700 0.051500 0.4914 248 0.0030 0.014700 0.02100 0.027300 0.9732 249 0.0000 0.000000 0.00000 0.000000 0.4138 250 21.0107 76.132150 103.09360 131.758400 1119.7042 251 0.0003 0.000700 0.00100 0.001300 0.9909 252 0.7673 2.205650 2.86460 3.795050 2549.9885 253 0.0094 0.024500 0.03080 0.037900 0.4517 254 0.0017 0.004700 0.01500 0.021300 0.0787 255 0.1269 0.307600 0.40510 0.480950 0.9255 256 0.0000 0.000000 0.00000 0.000000 0.0000 257 0.0000 0.000000 0.00000 0.000000 0.0000 258 0.0000 0.000000 0.00000 0.000000 0.0000 259 0.0000 0.000000 0.00000 0.000000 0.0000 260 0.0000 0.000000 0.00000 0.000000 0.0000 261 0.0000 0.000000 0.00000 0.000000 0.0000 262 0.0000 0.000000 0.00000 0.000000 0.0000 263 0.0000 0.000000 0.00000 0.000000 0.0000 264 0.0000 0.000000 0.00000 0.000000 0.0000 265 0.0000 0.000000 0.00000 0.000000 0.0000 266 0.0000 0.000000 0.00000 0.000000 0.0000 267 0.0198 0.044000 0.07060 0.091650 0.1578 268 6.0980 13.828000 17.97700 24.653000 40.8550 269 1.3017 2.956500 3.70350 4.379400 10.1529 270 15.5471 24.982300 28.77350 31.702200 158.5260 271 10.4015 30.013900 45.67650 59.594700 132.6479 272 6.9431 27.092725 40.01925 54.277325 122.1174 273 8.6512 18.247100 19.58090 22.097300 43.5737 274 0.0000 81.215600 110.60140 162.038200 659.1696 275 0.0111 0.044700 0.07840 0.144900 3332.5964 276 0.0000 0.000000 0.00000 0.000000 0.0000 277 0.5615 1.697700 2.08310 2.514300 32.1709 278 0.0000 0.000900 0.00110 0.001300 0.0034 279 0.0107 0.028300 0.03720 0.045800 0.1884 280 0.0073 0.014200 0.01690 0.020700 0.0755 281 0.0069 0.011900 0.01390 0.016600 0.0597 282 0.0016 0.003300 0.00530 0.007100 0.3083 283 0.5050 2.210400 2.65800 3.146200 232.8049 284 0.0000 0.000000 0.00000 0.000000 0.0000 285 0.4611 1.438175 1.87515 2.606950 6.8698 286 0.7280 2.467200 3.36005 4.311425 207.0161 287 0.0513 0.114875 0.13895 0.198450 292.2274 288 0.0012 0.002400 0.00360 0.004900 0.0749 289 0.3960 2.092125 2.54900 3.024525 59.5187 290 0.0416 0.064900 0.08330 0.118100 4.4203 291 0.0038 0.012500 0.01690 0.023600 0.6915 292 0.0041 0.008725 0.01100 0.014925 0.0831 293 82.3233 229.809450 317.86710 403.989300 879.2260 294 0.0000 185.089800 278.67190 428.554500 3933.7550 295 0.0000 130.220300 195.82560 273.952600 2005.8744 296 0.0000 603.032900 1202.41210 2341.288700 15559.9525 297 0.0000 210.936600 820.09880 3190.616400 18520.4683 298 0.0000 0.040700 0.05280 0.069200 0.5264 299 0.0000 0.030200 0.04000 0.052000 1.0312 300 0.0000 0.058900 0.08280 0.115500 1.8123 301 0.3100 0.717200 0.86040 1.046400 5.7110 302 0.1118 0.295800 0.38080 0.477000 5.1549 303 0.0108 0.030000 0.03880 0.048600 0.2258 304 0.0138 0.072800 0.13720 0.178500 0.3337 305 0.1171 0.225000 0.26430 0.307500 0.4750 306 0.0034 0.033100 0.04480 0.055200 0.2246 307 0.0549 0.113700 0.12950 0.147600 0.2112 308 0.0913 0.197600 0.21945 0.237900 0.3239 309 0.0549 0.113700 0.12950 0.147600 0.2112 310 0.1809 0.278550 0.30290 0.331900 0.4438 311 0.0328 0.077600 0.09770 0.115900 0.1784 312 0.0224 0.091500 0.12150 0.160175 0.7549 313 0.0000 0.000000 0.00000 0.000000 0.0000 314 0.0000 0.000000 0.00000 0.000000 0.0000 315 0.0000 0.000000 0.00000 0.000000 0.0000 316 2.7882 5.301525 5.83150 6.547800 13.0958 317 0.0283 0.117375 0.16340 0.218100 1.0034 318 0.9848 2.319725 2.89890 4.021250 15.8934 319 1.6574 6.245150 8.38880 9.481100 20.0455 320 0.0084 0.031200 0.03985 0.050200 0.9474 321 0.6114 1.670075 2.07765 2.633350 79.1515 322 0.0000 0.000000 0.00000 0.000000 0.0000 323 1.7101 4.272950 5.45880 6.344875 89.1917 324 2.2345 7.578600 12.50450 17.925175 51.8678 325 0.0000 0.000000 0.00000 0.000000 0.0000 326 0.0000 0.000000 0.00000 0.000000 0.0000 327 0.0000 0.000000 0.00000 0.000000 0.0000 328 0.0000 0.000000 0.00000 0.000000 0.0000 329 0.0000 0.000000 0.00000 0.000000 0.0000 330 0.0000 0.000000 0.00000 0.000000 0.0000 331 0.0224 0.068800 0.08480 0.095600 1.0959 332 0.5373 1.546550 2.06270 2.790525 174.8944 333 2.8372 5.453900 5.98010 6.549500 90.5159 334 0.0282 0.089400 0.12940 0.210400 3.4125 335 0.7899 2.035700 2.51350 3.360400 172.7119 336 5.2151 8.288525 9.07355 10.041625 214.8628 337 0.0000 1.542850 2.05445 2.785475 38.8995 338 0.0000 1.901350 2.56085 3.405450 196.6880 339 2.2001 7.588900 9.47420 10.439900 197.4988 340 0.0131 0.034600 0.04640 0.066800 5043.8789 341 0.5741 1.911800 2.37730 2.985400 97.7089 342 0.0000 0.000000 0.00000 0.000000 0.4472 343 1.2565 4.998900 6.00560 6.885200 156.3360 344 2.0560 17.860900 23.21470 28.873100 59.3241 345 1.7694 4.440600 5.56700 6.825500 257.0106 346 1.0177 2.532700 3.04640 4.085700 187.7589 347 0.0000 0.000000 0.00000 0.000000 13.9147 348 0.0103 0.018000 0.02260 0.027300 0.2200 349 0.0010 0.019600 0.02400 0.028600 0.1339 350 0.0029 0.014600 0.01880 0.028500 0.2914 351 0.0020 0.016600 0.02530 0.033900 0.6188 352 0.0056 0.016000 0.02200 0.026900 0.1429 353 0.0026 0.030200 0.04210 0.050200 0.1535 354 0.0040 0.034850 0.04420 0.050000 0.1344 355 0.0038 0.021200 0.02940 0.042300 0.2789 356 0.3796 1.025475 1.25530 1.533325 2.8348 357 0.0003 0.000700 0.00090 0.001100 0.0052 358 0.0017 0.002200 0.00240 0.002700 0.0047 359 0.0076 0.013800 0.01960 0.025000 0.0888 360 0.0001 0.000400 0.00070 0.001800 0.4090 361 10.7204 32.168700 39.69610 47.079200 547.1722 362 0.0028 0.009500 0.01250 0.018600 0.4163 363 60.9882 228.682525 309.83165 412.329775 1072.2031 364 0.0000 0.000000 0.00000 0.000000 0.0000 365 0.0017 0.003800 0.00460 0.005800 0.0368 366 0.0020 0.003500 0.00430 0.005400 0.0392 367 0.0000 0.002600 0.00320 0.004200 0.0357 368 0.0000 0.002200 0.00280 0.003600 0.0334 369 0.0000 0.000000 0.00000 0.000000 0.0000 370 0.0000 0.000000 0.00000 0.000000 0.0000 371 0.0000 0.000000 0.00000 0.000000 0.0000 372 0.0000 0.000000 0.00000 0.000000 0.0000 373 0.0000 0.000000 0.00000 0.000000 0.0000 374 0.0000 0.000000 0.00000 0.000000 0.0000 375 0.0000 0.000000 0.00000 0.000000 0.0000 376 0.0004 0.001300 0.00160 0.001900 0.0082 377 0.0004 0.001300 0.00150 0.001800 0.0077 378 0.0000 0.000000 0.00000 0.000000 0.0000 379 0.0000 0.000000 0.00000 0.000000 0.0000 380 0.0000 0.000000 0.00000 0.000000 0.0000 381 0.0000 0.000000 0.00000 0.000000 0.0000 382 0.0001 0.000400 0.00050 0.000800 0.6271 383 0.0875 0.295500 0.37260 0.541200 30.9982 384 0.3383 0.842300 1.10630 1.386600 74.8445 385 0.0000 0.005300 0.00680 0.011325 0.2073 386 0.0008 0.004800 0.00680 0.009300 0.3068 387 0.0000 0.000000 0.00000 0.000000 0.1309 388 6.3101 24.386550 32.53070 42.652450 348.8293 389 0.0001 0.000200 0.00030 0.000400 0.3127 390 0.3046 0.675150 0.87730 1.148200 805.3936 391 0.0031 0.008300 0.01020 0.012400 0.1375 392 0.0005 0.001500 0.00490 0.006900 0.0229 393 0.0342 0.104400 0.13390 0.160400 0.2994 394 0.0000 0.000000 0.00000 0.000000 0.0000 395 0.0000 0.000000 0.00000 0.000000 0.0000 396 0.0000 0.000000 0.00000 0.000000 0.0000 397 0.0000 0.000000 0.00000 0.000000 0.0000 398 0.0000 0.000000 0.00000 0.000000 0.0000 399 0.0000 0.000000 0.00000 0.000000 0.0000 400 0.0000 0.000000 0.00000 0.000000 0.0000 401 0.0000 0.000000 0.00000 0.000000 0.0000 402 0.0000 0.000000 0.00000 0.000000 0.0000 403 0.0000 0.000000 0.00000 0.000000 0.0000 404 0.0000 0.000000 0.00000 0.000000 0.0000 405 0.0062 0.014000 0.02390 0.032300 0.0514 406 2.0545 4.547600 5.92010 8.585200 14.7277 407 0.4240 0.966500 1.23970 1.416700 3.3128 408 2.7378 4.127800 4.92245 5.787100 44.3100 409 1.2163 3.012800 4.48970 5.936700 9.5765 410 0.7342 3.265075 4.73275 6.458300 13.8071 411 0.9609 2.321300 2.54810 2.853200 6.2150 412 0.0000 18.407900 26.15690 38.139700 128.2816 413 4.0416 11.375800 20.25510 29.307300 899.1190 414 0.0000 0.000000 0.00000 0.000000 0.0000 415 1.5340 4.927400 6.17660 7.570700 116.8615 416 0.0000 2.660100 3.23400 4.010700 9.6900 417 2.1531 5.765500 7.39560 9.168800 39.0376 418 0.0000 0.000000 302.17760 524.002200 999.3160 419 0.0000 0.000000 272.44870 582.935200 998.6813 420 0.4411 1.030400 1.64510 2.214700 111.4956 421 0.7217 3.184200 3.94310 4.784300 273.0952 422 0.0000 0.000000 0.00000 0.000000 0.0000 423 23.0200 55.976675 69.90545 92.911500 424.2152 424 0.4866 1.965250 2.66710 3.470975 103.1809 425 1.4666 3.766200 4.76440 6.883500 898.6085 426 0.3632 0.743425 1.13530 1.539500 24.9904 427 0.6637 3.113225 3.94145 4.768650 113.2230 428 1.1198 1.935500 2.53410 3.609000 118.7533 429 0.7837 2.571400 3.45380 4.755800 186.6164 430 0.0000 6.999700 11.10560 17.423100 400.0000 431 0.0000 11.059000 16.38100 21.765200 400.0000 432 0.0000 31.032400 57.96930 120.172900 994.2857 433 0.0000 10.027100 151.11560 305.026300 995.7447 434 0.0000 7.550700 10.19770 12.754200 400.0000 435 0.0000 3.494400 4.55110 5.822800 400.0000 436 0.0000 1.950900 2.76430 3.822200 400.0000 437 1.1568 3.070700 3.78090 4.678600 32.2740 438 0.0000 36.290300 49.09090 66.666700 851.6129 439 14.1206 48.173800 65.43780 84.973400 657.7621 440 1.0973 5.414100 12.08590 15.796400 33.0580 441 0.3512 0.679600 0.80760 0.927600 1.2771 442 0.0974 0.907650 1.26455 1.577825 5.1317 443 0.2169 0.550500 0.64350 0.733425 1.0851 444 0.3336 0.804800 0.90270 0.988800 1.3511 445 0.3086 0.555800 0.65110 0.748400 1.1087 446 0.6968 1.046800 1.16380 1.272300 1.7639 447 0.0846 0.226100 0.27970 0.338825 0.5085 448 0.0399 0.187700 0.25120 0.351100 1.4754 449 0.0000 0.000000 0.00000 0.000000 0.0000 450 0.0000 0.000000 0.00000 0.000000 0.0000 451 0.0000 0.000000 0.00000 0.000000 0.0000 452 2.6709 4.764200 5.27145 5.913000 13.9776 453 0.9037 3.747875 5.22710 6.902475 34.4902 454 2.3294 5.806525 7.42490 9.576775 42.0703 455 0.6948 2.899675 3.72450 4.341925 10.1840 456 3.0489 8.816575 11.35090 14.387900 232.1258 457 1.4428 3.827525 4.79335 6.089450 164.1093 458 0.0000 0.000000 0.00000 0.000000 0.0000 459 0.9910 2.291175 2.83035 3.309225 47.7772 460 7.9534 20.221850 26.16785 35.278800 149.3851 461 0.0000 0.000000 0.00000 0.000000 0.0000 462 0.0000 0.000000 0.00000 0.000000 0.0000 463 0.0000 0.000000 0.00000 0.000000 0.0000 464 0.0000 0.000000 0.00000 0.000000 0.0000 465 0.0000 0.000000 0.00000 0.000000 0.0000 466 0.0000 0.000000 0.00000 0.000000 0.0000 467 1.7163 4.697500 5.64500 6.386900 109.0074 468 0.0000 38.472775 150.34010 335.922400 999.8770 469 2.6009 4.847200 5.47240 6.005700 77.8007 470 0.8325 2.823300 4.06110 7.006800 87.1347 471 2.4026 5.807300 7.39600 9.720200 212.6557 472 11.4997 105.525150 138.25515 168.410125 492.7718 473 0.0000 24.900800 34.24675 47.727850 358.9504 474 0.0000 23.156500 32.82005 45.169475 415.4355 475 1.1011 3.494500 4.27620 4.741800 79.1162 476 0.0000 11.577100 15.97380 23.737200 274.8871 477 1.6872 4.105400 5.24220 6.703800 289.8264 478 0.0000 0.000000 0.00000 0.000000 200.0000 479 0.6459 2.627700 3.18450 3.625300 63.3336 480 8.8406 52.894500 70.43450 93.119600 221.9747 481 0.0000 0.000000 0.00000 0.000000 0.0000 482 0.0000 0.000000 293.51850 514.585900 999.4135 483 0.0000 81.316150 148.31750 262.865250 989.4737 484 0.0000 76.455400 138.77550 294.667050 996.8586 485 0.0000 50.383550 112.95340 288.893450 994.0000 486 0.0000 0.000000 249.92700 501.607450 999.4911 487 0.0000 55.555150 112.27550 397.506100 995.7447 488 0.0000 139.914350 348.52940 510.647150 997.5186 489 0.0000 112.859250 219.48720 377.144200 994.0035 490 13.7225 38.391100 48.55745 61.494725 142.8436 491 0.5558 1.747100 2.25080 2.839800 12.7698 492 4.8882 6.924650 8.00895 9.078900 21.0443 493 0.8330 1.663750 2.52910 3.199100 9.4024 494 0.0342 0.139000 0.23250 0.563000 127.5728 495 1.7720 5.274600 6.60790 7.897200 107.6926 496 4.8135 16.342300 22.03910 32.438475 219.6436 497 1.9496 8.150350 10.90655 14.469050 40.2818 498 0.0000 0.000000 0.00000 0.000000 0.0000 499 0.0000 0.000000 0.00000 536.204600 1000.0000 500 0.0000 0.000000 0.00000 505.401000 999.2337 501 0.0000 0.000000 0.00000 0.000000 0.0000 502 0.0000 0.000000 0.00000 0.000000 0.0000 503 0.0000 0.000000 0.00000 0.000000 0.0000 504 0.0000 0.000000 0.00000 0.000000 0.0000 505 0.0000 0.000000 0.00000 0.000000 0.0000 506 0.0000 0.000000 0.00000 0.000000 0.0000 507 0.0000 0.000000 0.00000 0.000000 0.0000 508 0.0000 0.000000 0.00000 0.000000 0.0000 509 0.0000 0.000000 0.00000 0.000000 0.0000 510 0.0000 35.322200 46.98610 64.248700 451.4851 511 0.0000 0.000000 0.00000 555.294100 1000.0000 512 0.0000 0.000000 0.00000 0.000000 0.0000 513 0.0000 0.000000 0.00000 0.000000 0.0000 514 0.0000 0.000000 0.00000 0.000000 0.0000 515 0.0000 0.000000 0.00000 0.000000 0.0000 516 0.0287 0.121500 0.17470 0.264900 252.8604 517 0.2880 0.890300 1.15430 1.759700 113.2758 518 0.4674 1.171200 1.58910 1.932800 111.3495 519 0.0000 4.160300 5.83295 10.971850 184.3488 520 0.3121 1.552150 2.22100 2.903700 111.7365 521 0.0000 0.000000 0.00000 0.000000 1000.0000 522 2.6811 10.182800 13.74260 17.808950 137.9838 523 0.0258 0.073050 0.10000 0.133200 111.3330 524 1.3104 3.769650 4.87710 6.450650 818.0005 525 1.5400 4.101500 5.13420 6.329500 80.0406 526 0.1705 0.484200 1.55010 2.211650 8.2037 527 2.1700 4.895450 6.41080 7.594250 14.4479 528 0.0000 0.000000 0.00000 0.000000 0.0000 529 0.0000 0.000000 0.00000 0.000000 0.0000 530 0.0000 0.000000 0.00000 0.000000 0.0000 531 0.0000 0.000000 0.00000 0.000000 0.0000 532 0.0000 0.000000 0.00000 0.000000 0.0000 533 0.0000 0.000000 0.00000 0.000000 0.0000 534 0.0000 0.000000 0.00000 0.000000 0.0000 535 0.0000 0.000000 0.00000 0.000000 0.0000 536 0.0000 0.000000 0.00000 0.000000 0.0000 537 0.0000 0.000000 0.00000 0.000000 0.0000 538 0.0000 0.000000 0.00000 0.000000 0.0000 539 0.8516 1.889900 3.05480 3.947000 6.5803 540 0.6144 1.385300 1.78550 2.458350 4.0825 541 3.2761 7.495750 9.45930 11.238400 25.7792 542 0.1053 0.109600 0.10960 0.113400 0.1184 543 0.0051 0.007800 0.00780 0.009000 0.0240 544 0.0016 0.002400 0.00260 0.002600 0.0047 545 4.4294 7.116000 7.11600 8.020700 21.0443 546 0.4444 0.797500 0.91110 1.285550 3.9786 547 372.8220 400.694000 403.12200 407.431000 421.7020 548 71.0380 73.254000 74.08400 78.397000 83.7200 549 0.0446 0.226250 0.47100 0.850350 7.0656 550 6.1100 14.530000 16.34000 19.035000 131.6800 551 0.1200 0.870000 1.15000 1.370000 39.3300 552 0.0187 0.094900 0.19790 0.358450 2.7182 553 2.7860 6.738100 7.42790 8.637150 56.9303 554 0.0520 0.343800 0.47890 0.562350 17.4781 555 4.8269 27.017600 54.44170 74.628700 303.5500 556 1.4967 3.625100 4.06710 4.702700 35.3198 557 0.1646 1.182900 1.52980 1.815600 54.2917 558 0.8919 0.955200 0.97270 1.000800 1.5121 559 0.0699 0.149825 0.29090 0.443600 1.0737 560 0.0177 0.036200 0.05920 0.089000 0.4457 561 7.2369 15.762450 29.73115 44.113400 101.1146 562 242.2860 259.972500 264.27200 265.707000 311.4040 563 0.3049 0.567100 0.65100 0.768875 1.2988 564 0.9700 4.980000 5.16000 7.800000 32.5800 565 0.0224 0.087700 0.11955 0.186150 0.6892 566 0.4122 2.090200 2.15045 3.098725 14.0141 567 0.0091 0.038200 0.04865 0.075275 0.2932 568 0.3706 1.884400 1.99970 2.970850 12.7462 569 3.2504 15.466200 16.98835 24.772175 84.8024 570 317.1964 530.702700 532.39820 534.356400 589.5082 571 0.9802 1.982900 2.11860 2.290650 2.7395 572 3.5400 7.500000 8.65000 10.130000 454.5600 573 0.0667 0.242250 0.29340 0.366900 2.1967 574 1.0395 2.567850 2.97580 3.492500 170.0204 575 0.0230 0.075100 0.08950 0.112150 0.5502 576 0.6636 1.408450 1.62450 1.902000 90.4235 577 4.5820 11.501550 13.81790 17.080900 96.9601 578 -0.0169 0.013800 0.02040 0.027700 0.1028 579 0.0032 0.010600 0.01480 0.020000 0.0799 580 0.0010 0.003400 0.00470 0.006475 0.0286 581 0.0000 46.184900 72.28890 116.539150 737.3048 582 0.4778 0.497900 0.50020 0.502375 0.5098 583 0.0060 0.011600 0.01380 0.016500 0.4766 584 0.0017 0.003100 0.00360 0.004100 0.1045 585 1.1975 2.306500 2.75765 3.295175 99.3032 586 -0.0169 0.013425 0.02050 0.027600 0.1028 587 0.0032 0.010600 0.01480 0.020300 0.0799 588 0.0010 0.003300 0.00460 0.006400 0.0286 589 0.0000 44.368600 71.90050 114.749700 737.3048 Pass/Fail -1.0000 -1.000000 -1.00000 -1.000000 1.0000
Insights
- The Pass/Fail column predominantly consists of "-1" values, indicating that most of the production entities pass the in-house line testing.
However, there are instances (as indicated by the maximum value of 1) where the production entities fail the testing. 2. There are many columns with '0' values in all rows. 3. There are many outliers in data if we consider mean and max values 4. We have 592 columns and 1567 rows.
Q.2. Data cleansing:¶
#Lets save column count before data processing
num_columns_before = df_signal.shape[1]
Q.2.A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature.¶
# Calculate the threshold for 20% null values
threshold = 0.2 * len(df_signal)
# List to store features to be dropped
features_to_drop = []
# Iterate over each feature
for feature in df_signal.columns:
# Count null values for each feature
null_count = df_signal[feature].isnull().sum()
# Check if null count exceeds the threshold
if null_count >= threshold:
features_to_drop.append(feature)
else:
# if pd.api.types.is_numeric_dtype(df_signal[feature]):
if feature != 'Time':
# Impute null values with mean for numeric features with less than 20% null values
mean_value = df_signal[feature].mean()
df_signal[feature].fillna(mean_value, inplace=True)
# Drop features with 20%+ null values
df_signal.drop(columns=features_to_drop, inplace=True)
print('Data shape after above activity:', df_signal.shape)
Data shape after above activity: (1567, 560)
Q.2.B. Identify and drop the features which are having same value for all the rows¶
# Identify features with constant values
constant_features = [col for col in df_signal.columns if df_signal[col].nunique() == 1]
# Print constant features in a single line
print("Features with same values for all rows those needs to be dropped:", ", ".join(constant_features))
# Drop constant features
df_signal.drop(columns=constant_features, inplace=True)
print('Data shape after above activity:', df_signal.shape)
Features with same values for all rows those needs to be dropped: 5, 13, 42, 49, 52, 69, 97, 141, 149, 178, 179, 186, 189, 190, 191, 192, 193, 194, 226, 229, 230, 231, 232, 233, 234, 235, 236, 237, 240, 241, 242, 243, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 276, 284, 313, 314, 315, 322, 325, 326, 327, 328, 329, 330, 364, 369, 370, 371, 372, 373, 374, 375, 378, 379, 380, 381, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 414, 422, 449, 450, 451, 458, 461, 462, 463, 464, 465, 466, 481, 498, 501, 502, 503, 504, 505, 506, 507, 508, 509, 512, 513, 514, 515, 528, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538 Data shape after above activity: (1567, 444)
Q.2.C. Drop other features if required using relevant functional knowledge. Clearly justify the same.¶
Steps to follow
- Check features with very few unique values.
- Check and drop features with too many zeros
- Transform Time into Year, Month, Day and DayOfWeek column to check pattern and correlation with target variable. Delete Time and any column that has only 1 value.
Lets oberve features with less number of unique values. If we have less unique values and if it is continuous variable then it may not be of use.
features_with_few_unique_values = []
for column in df_signal.columns:
unique_values_count = df_signal[column].nunique()
if unique_values_count < 20:
features_with_few_unique_values.append(column)
# Print unique values and their occurrences for features with few unique values
for feature in features_with_few_unique_values:
unique_values_counts = df_signal[feature].value_counts()
unique_values_info = [f"{value} ({count})" for value, count in unique_values_counts.items()]
print(f"Feature '{feature}': {' | '.join(unique_values_info)}")
Feature '74': 0.0 (1560) | 0.002687700192184497 (6) | 4.1955 (1) Feature '95': 0.0 (677) | 0.0001 (516) | 0.0002 (219) | -0.0001 (95) | 0.0003 (28) | -0.0002 (8) | 6.0025624599615635e-05 (6) | 0.0004 (5) | -0.0004 (4) | -0.0003 (3) | 0.0007 (2) | -0.0009 (1) | 0.0009 (1) | -0.0005 (1) | 0.0006 (1) Feature '206': 0.0 (1560) | 0.0012812299807815502 (6) | 2.0 (1) Feature '209': 0.0 (1560) | 0.02956438180653427 (6) | 46.15 (1) Feature '342': 0.0 (1560) | 0.00028648302370275463 (6) | 0.4472 (1) Feature '347': 0.0 (1560) | 0.008913965406790519 (6) | 13.9147 (1) Feature '478': 0.0 (1560) | 0.12812299807815503 (6) | 200.0 (1) Feature '521': 0.0 (1546) | 1000.0 (14) | 907.91 (1) | 776.2169 (1) | 158.2158 (1) | 604.2009 (1) | 718.6039999999999 (1) | 553.2097 (1) | 474.6376 (1) Feature 'Pass/Fail': -1 (1463) | 1 (104)
We can see there are features of continuous variab;e type but having 0s in maximum number of rows.
features_to_drop = [] # List to store features to be dropped
for feature in features_with_few_unique_values:
zero_count = (df_signal[feature] == 0).sum() # Count of 0 values in the feature
zero_percentage = zero_count / len(df_signal) # Percentage of 0 values
if zero_percentage > 0.8:
features_to_drop.append(feature) # Add the feature to be dropped to the list
# Drop features with 0s in more than 80% rows
df_signal.drop(columns=features_to_drop, inplace=True)
print("Features with few unique values after removing those with 0s in more than 80% rows:")
print(features_with_few_unique_values)
print(df_signal.shape)
Features with few unique values after removing those with 0s in more than 80% rows: ['74', '95', '206', '209', '342', '347', '478', '521', 'Pass/Fail'] (1567, 437)
Now lets find features with 80% 0s in it
# Find features with 0 value in more than 80% of the rows
features_with_high_zero_percentage = []
for column in df_signal.columns:
zero_count = (df_signal[column] == 0).sum()
zero_percentage = zero_count / len(df_signal)
if zero_percentage > 0.8:
features_with_high_zero_percentage.append(column)
# Drop features with 0s in more than 80% rows
df_signal.drop(columns=features_with_high_zero_percentage, inplace=True)
print("Features with 0 value in more than 80% of the rows:")
print(features_with_high_zero_percentage)
print("DataFrame shape after dropping rows with 0 value in more than 85% of the rows for each feature:")
print(df_signal.shape)
Features with 0 value in more than 80% of the rows: ['114', '249', '387'] DataFrame shape after dropping rows with 0 value in more than 85% of the rows for each feature: (1567, 434)
Justification (Deleting rows with 0s) : Upon analysis, it's evident that some continuous variables contain predominantly zero values across the dataset. These features lack variability and contribute little to the model-building process. Given their negligible contribution and the fact that they are continuous rather than categorical, it is prudent to remove these features from the dataset to streamline and enhance the modeling process.
Check usability of 'Time' feature Upon initial examination, it appears that the 'Time' feature, which comprises datetime values, may not provide significant utility for our analysis. However, to ensure thoroughness, we can explore potential patterns by disaggregating the data based on the year, month, and day components of the 'Time' feature. This approach will help us determine if there are any discernible trends or correlations between these temporal aspects and the pass/fail outcomes.
df_signal['Time'] = pd.to_datetime(df_signal['Time'])
df_signal['Year'] = df_signal['Time'].dt.year
df_signal['Month'] = df_signal['Time'].dt.month
df_signal['Day'] = df_signal['Time'].dt.day
df_signal['Weekday'] = df_signal['Time'].dt.weekday # Monday=0, Sunday=6
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
# Plot each feature against 'Pass/Fail'
sns.countplot(x='Year', hue='Pass/Fail', data=df_signal, ax=axes[0, 0])
sns.countplot(x='Month', hue='Pass/Fail', data=df_signal, ax=axes[0, 1])
sns.countplot(x='Day', hue='Pass/Fail', data=df_signal, ax=axes[1, 0])
sns.countplot(x='Weekday', hue='Pass/Fail', data=df_signal, ax=axes[1, 1])
# Add titles and adjust layout
axes[0, 0].set_title('Year vs Pass/Fail')
axes[0, 1].set_title('Month vs Pass/Fail')
axes[1, 0].set_title('Day vs Pass/Fail')
axes[1, 1].set_title('Weekday vs Pass/Fail')
plt.tight_layout()
# Show the plots
plt.show()
Justification (Delete Year and Time) After reviewing the visualizations presented above, we have decided to eliminate the 'Year' column from our dataset due to its singular unique value of '2008'. However, upon further analysis, it is evident that the newly created features 'Month' and 'Day' exhibit distinctive distributions. Consequently, we have opted to retain these columns as they are likely to contribute valuable information to our model.
Also, now that we transformed Time data into meaningful variables like month, day, we can now delete Time as it may not add any value in it original format.
df_signal.drop(columns=['Year'], inplace=True)
# Drop the original 'Time' column
df_signal.drop(columns=['Time'], inplace=True)
Q.2.C. Check for multi-collinearity in the data and take necessary action.¶
Removing high multi-colinearity
- Lets run the loop for all features and check correlated features (can be multiple).
- We will use some threshold value (80%) and will find correlated features above threshold.
- We will keep original feature for which we were runnin gloop and will delete all correlated features
- We will print correlated features along with respective correlation to have transparancy and to avoid mistakes
correlation_matrix = df_signal.corr()
# Print features along with their respective correlation values for higher correlation
threshold_correlation = 0.80
features_to_drop = []
deleted_columns_with_correlation = {}
processed_features = set()
for column in correlation_matrix.columns:
if column not in processed_features:
correlated_features = correlation_matrix[column][(np.abs(correlation_matrix[column]) >= threshold_correlation)]
if not correlated_features.empty:
# Drop the feature itself from the list of correlated features
correlated_features = correlated_features.drop(column, errors='ignore')
# Store the correlated features along with their correlation values
deleted_columns_with_correlation[column] = correlated_features
features_to_drop.extend(correlated_features.index)
processed_features.update(correlated_features.index)
# Drop correlated features
df_signal.drop(columns=features_to_drop, inplace=True)
# Print deleted columns along with their correlation values
print("Deleted columns with their correlated features and correlation values:")
for column, correlated_features in deleted_columns_with_correlation.items():
print(f"Column: {column}")
for feature, correlation_value in correlated_features.items():
print(f" - Correlated feature: {feature}, Correlation value: {correlation_value}")
print("Data shape after dropping correlated columns:")
print(df_signal.shape)
# Print features_to_drop in a single line
unique_features_to_drop = list(set(features_to_drop))
# Print unique features_to_drop in a single line
print("Unique features to drop:", ", ".join(unique_features_to_drop))
print("Number of unique features to drop:", len(unique_features_to_drop))
Deleted columns with their correlated features and correlation values: Column: 0 Column: 1 Column: 2 Column: 3 Column: 4 - Correlated feature: 140, Correlation value: 0.9999751247610734 - Correlated feature: 275, Correlation value: 0.9999755698304461 - Correlated feature: 413, Correlation value: 0.9384157839693643 Column: 6 Column: 7 Column: 8 Column: 9 Column: 10 Column: 11 Column: 12 Column: 14 Column: 15 Column: 16 - Correlated feature: 147, Correlation value: 0.8856942312057353 - Correlated feature: 148, Correlation value: 0.9702941014493074 - Correlated feature: 152, Correlation value: 0.9775661160561907 - Correlated feature: 154, Correlation value: 0.8736683117857272 - Correlated feature: 282, Correlation value: 0.8847734601955165 - Correlated feature: 283, Correlation value: 0.9713232599947812 - Correlated feature: 287, Correlation value: 0.9776476351736362 - Correlated feature: 289, Correlation value: 0.8771312300857304 - Correlated feature: 420, Correlation value: 0.8963184443097968 - Correlated feature: 421, Correlation value: 0.9630496567159946 - Correlated feature: 425, Correlation value: 0.9367395458941122 - Correlated feature: 427, Correlation value: 0.8934125804427402 Column: 17 Column: 18 Column: 19 - Correlated feature: 155, Correlation value: -0.805518312476151 - Correlated feature: 290, Correlation value: -0.814493635042058 - Correlated feature: 428, Correlation value: -0.8471764161417576 Column: 20 Column: 21 Column: 22 Column: 23 Column: 24 Column: 25 - Correlated feature: 26, Correlation value: 0.8231111215973265 - Correlated feature: 27, Correlation value: 0.9803753833240896 Column: 28 Column: 29 - Correlated feature: 30, Correlation value: 0.8581473047890565 Column: 31 Column: 32 Column: 33 Column: 34 - Correlated feature: 36, Correlation value: -0.9999999997052148 Column: 35 Column: 37 Column: 38 Column: 39 Column: 40 Column: 41 Column: 43 - Correlated feature: 60, Correlation value: 0.8985252230295192 Column: 44 Column: 45 - Correlated feature: 46, Correlation value: 0.8090426063697485 Column: 47 Column: 48 Column: 50 - Correlated feature: 46, Correlation value: 0.904481825916677 Column: 51 Column: 53 - Correlated feature: 54, Correlation value: 0.9352211984898346 Column: 55 Column: 56 Column: 57 Column: 58 Column: 59 Column: 61 Column: 62 Column: 63 Column: 64 - Correlated feature: 65, Correlation value: 0.8433685545726131 Column: 66 - Correlated feature: 46, Correlation value: 0.8237626375988376 - Correlated feature: 70, Correlation value: 0.9044609903388542 Column: 67 - Correlated feature: 196, Correlation value: 0.8587815560501445 - Correlated feature: 197, Correlation value: 0.8636750689498435 - Correlated feature: 199, Correlation value: 0.8109284322378567 - Correlated feature: 204, Correlation value: 0.9022307064188828 - Correlated feature: 205, Correlation value: 0.8716725202346604 - Correlated feature: 207, Correlation value: 0.8599106909925411 - Correlated feature: 332, Correlation value: 0.8783145399781462 - Correlated feature: 333, Correlation value: 0.8737528985362206 - Correlated feature: 335, Correlation value: 0.8491971118754746 - Correlated feature: 336, Correlation value: 0.8715138675248129 - Correlated feature: 340, Correlation value: 0.9466368752862868 - Correlated feature: 341, Correlation value: 0.9053174973585898 - Correlated feature: 343, Correlation value: 0.8742735871968653 - Correlated feature: 469, Correlation value: 0.868775488079695 - Correlated feature: 477, Correlation value: 0.9218150346605672 - Correlated feature: 479, Correlation value: 0.8514148502196563 Column: 68 Column: 71 Column: 75 Column: 76 Column: 77 Column: 78 Column: 79 Column: 80 Column: 81 Column: 82 Column: 83 Column: 84 Column: 86 Column: 87 Column: 88 Column: 89 Column: 90 Column: 91 Column: 92 - Correlated feature: 105, Correlation value: -0.9888957525678773 Column: 93 - Correlated feature: 106, Correlation value: -0.9912928531099834 Column: 94 - Correlated feature: 96, Correlation value: -0.9570098301742919 - Correlated feature: 98, Correlation value: 0.8385851576539535 Column: 95 Column: 99 - Correlated feature: 104, Correlation value: -0.989545429124828 Column: 100 Column: 101 - Correlated feature: 98, Correlation value: 0.9067879315108982 Column: 102 Column: 103 Column: 107 Column: 108 Column: 113 Column: 115 Column: 116 Column: 117 - Correlated feature: 252, Correlation value: 0.9861933657838748 - Correlated feature: 390, Correlation value: 0.9862339054518433 - Correlated feature: 524, Correlation value: 0.9786433214466864 Column: 118 Column: 119 - Correlated feature: 526, Correlation value: -0.8139549042186852 Column: 120 Column: 121 - Correlated feature: 123, Correlation value: 0.942282649762535 - Correlated feature: 124, Correlation value: 0.8930804590802048 Column: 122 - Correlated feature: 127, Correlation value: 0.962085701180184 - Correlated feature: 130, Correlation value: -0.8321509702312138 Column: 125 Column: 126 Column: 128 Column: 129 Column: 131 Column: 132 Column: 133 Column: 134 Column: 135 - Correlated feature: 270, Correlation value: 0.946473934995659 - Correlated feature: 408, Correlation value: 0.9991662845035244 Column: 136 - Correlated feature: 271, Correlation value: 0.9712649283173538 - Correlated feature: 409, Correlation value: 0.9975306961053592 Column: 137 - Correlated feature: 272, Correlation value: 0.9768378109004282 - Correlated feature: 410, Correlation value: 0.9958138223714479 Column: 138 - Correlated feature: 273, Correlation value: 0.9191717004943404 - Correlated feature: 411, Correlation value: 0.9979823996066916 Column: 139 - Correlated feature: 274, Correlation value: 0.9857281443608049 - Correlated feature: 412, Correlation value: 0.8490841595671681 Column: 142 - Correlated feature: 277, Correlation value: 0.97489173993164 - Correlated feature: 415, Correlation value: 0.9926143870057583 Column: 143 - Correlated feature: 278, Correlation value: 0.9111875876971073 - Correlated feature: 416, Correlation value: 0.9983535268348297 Column: 144 - Correlated feature: 279, Correlation value: 0.9767552096844969 - Correlated feature: 417, Correlation value: 0.9932605213388946 Column: 145 - Correlated feature: 280, Correlation value: 0.9597816072041465 Column: 146 - Correlated feature: 281, Correlation value: 0.9538695194170509 Column: 150 - Correlated feature: 285, Correlation value: 0.9702572948550757 Column: 151 - Correlated feature: 286, Correlation value: 0.9900031988390738 - Correlated feature: 424, Correlation value: 0.9762310553985714 Column: 153 - Correlated feature: 288, Correlation value: 0.9982217992904813 - Correlated feature: 426, Correlation value: 0.9957198980645307 Column: 156 - Correlated feature: 291, Correlation value: 0.9930169405207437 - Correlated feature: 429, Correlation value: 0.9982487349306526 Column: 159 - Correlated feature: 164, Correlation value: 0.8006080651811647 - Correlated feature: 294, Correlation value: 0.9932395425046524 - Correlated feature: 430, Correlation value: 0.8664413438952592 Column: 160 - Correlated feature: 295, Correlation value: 0.9964804116578503 - Correlated feature: 431, Correlation value: 0.811687539625834 Column: 161 - Correlated feature: 296, Correlation value: 0.9946485928894578 Column: 162 - Correlated feature: 297, Correlation value: 0.9884014877258696 Column: 163 - Correlated feature: 164, Correlation value: 0.9248163707333443 - Correlated feature: 165, Correlation value: 0.8978959469955342 - Correlated feature: 298, Correlation value: 0.9933923566636502 - Correlated feature: 299, Correlation value: 0.9227460416812279 - Correlated feature: 300, Correlation value: 0.9007296768852807 - Correlated feature: 430, Correlation value: 0.8267685015840507 - Correlated feature: 431, Correlation value: 0.8113194102863331 - Correlated feature: 434, Correlation value: 0.8763638597692313 - Correlated feature: 435, Correlation value: 0.8446316896884513 - Correlated feature: 436, Correlation value: 0.8397039520251955 Column: 166 - Correlated feature: 301, Correlation value: 0.9642422923063492 - Correlated feature: 437, Correlation value: 0.9901707034186475 Column: 167 - Correlated feature: 302, Correlation value: 0.97848862988039 Column: 168 - Correlated feature: 303, Correlation value: 0.9641650773883766 Column: 169 - Correlated feature: 304, Correlation value: 0.9756858945190093 - Correlated feature: 440, Correlation value: 0.9957217269246035 Column: 170 - Correlated feature: 305, Correlation value: 0.961749880056381 - Correlated feature: 441, Correlation value: 0.9949854256795806 Column: 171 - Correlated feature: 306, Correlation value: 0.9872003574538434 - Correlated feature: 442, Correlation value: 0.9745585292139416 Column: 172 - Correlated feature: 174, Correlation value: 0.9999998114016202 - Correlated feature: 307, Correlation value: 0.9599578693189514 - Correlated feature: 309, Correlation value: 0.959980831223189 - Correlated feature: 443, Correlation value: 0.9985340536228607 - Correlated feature: 445, Correlation value: 0.997069575051326 Column: 173 - Correlated feature: 308, Correlation value: 0.9587135486766537 - Correlated feature: 444, Correlation value: 0.993856721934056 Column: 175 - Correlated feature: 310, Correlation value: 0.9550617995202938 - Correlated feature: 446, Correlation value: 0.9994829487948597 Column: 176 - Correlated feature: 311, Correlation value: 0.9794835026346246 - Correlated feature: 447, Correlation value: 0.9998871452073378 Column: 177 - Correlated feature: 312, Correlation value: 0.9971297572419839 - Correlated feature: 448, Correlation value: 0.9995369413388282 Column: 180 - Correlated feature: 316, Correlation value: 0.8810806443748563 - Correlated feature: 452, Correlation value: 0.9944279958815152 Column: 181 - Correlated feature: 317, Correlation value: 0.956757128789213 - Correlated feature: 453, Correlation value: 0.9991299429621868 Column: 182 - Correlated feature: 318, Correlation value: 0.9808274700701617 - Correlated feature: 454, Correlation value: 0.9890820509449046 Column: 183 - Correlated feature: 319, Correlation value: 0.9821899876660052 - Correlated feature: 455, Correlation value: 0.9979569278279253 Column: 184 - Correlated feature: 320, Correlation value: 0.9913047453517891 - Correlated feature: 456, Correlation value: 0.9701564393809522 Column: 185 - Correlated feature: 187, Correlation value: 0.8266683422491669 - Correlated feature: 321, Correlation value: 0.9942440963683824 - Correlated feature: 323, Correlation value: 0.8161383119821825 - Correlated feature: 457, Correlation value: 0.9967773319710982 - Correlated feature: 459, Correlation value: 0.8219117433634623 Column: 188 - Correlated feature: 324, Correlation value: 0.9753022699264602 Column: 195 - Correlated feature: 331, Correlation value: 0.9453081768272615 - Correlated feature: 467, Correlation value: 0.9992729354838393 Column: 198 - Correlated feature: 334, Correlation value: 0.9865835338216226 - Correlated feature: 470, Correlation value: 0.9970906141746344 Column: 200 Column: 201 - Correlated feature: 202, Correlation value: 0.8021281017312017 - Correlated feature: 337, Correlation value: 0.9322566640115545 - Correlated feature: 473, Correlation value: 0.8699875437146526 Column: 203 - Correlated feature: 196, Correlation value: 0.8135748787114265 - Correlated feature: 199, Correlation value: 0.8004012401977514 - Correlated feature: 202, Correlation value: 0.8436442160004376 - Correlated feature: 207, Correlation value: 0.8606436643272622 - Correlated feature: 338, Correlation value: 0.8621201954606564 - Correlated feature: 339, Correlation value: 0.9827091137191302 - Correlated feature: 471, Correlation value: 0.8038480339926433 - Correlated feature: 475, Correlation value: 0.9970208654714723 - Correlated feature: 479, Correlation value: 0.8821375106262789 Column: 208 - Correlated feature: 344, Correlation value: 0.9636875346683675 - Correlated feature: 480, Correlation value: 0.8005403729573214 Column: 210 - Correlated feature: 348, Correlation value: 0.9497334829668724 Column: 211 - Correlated feature: 349, Correlation value: 0.9886758542873774 Column: 212 - Correlated feature: 350, Correlation value: 0.9935343473973235 Column: 213 - Correlated feature: 351, Correlation value: 0.9950937736383662 Column: 214 - Correlated feature: 352, Correlation value: 0.9792808190926792 Column: 215 - Correlated feature: 353, Correlation value: 0.9781164702047475 Column: 216 - Correlated feature: 354, Correlation value: 0.9709982864685331 Column: 217 - Correlated feature: 355, Correlation value: 0.9872908993404127 Column: 218 - Correlated feature: 356, Correlation value: 0.9499028434298444 - Correlated feature: 490, Correlation value: 0.9803379721769947 Column: 219 - Correlated feature: 357, Correlation value: 0.9787790032809582 - Correlated feature: 491, Correlation value: 0.9962413539123935 Column: 221 - Correlated feature: 359, Correlation value: 0.9799983961126711 - Correlated feature: 493, Correlation value: 0.9989352206061557 Column: 222 - Correlated feature: 360, Correlation value: 0.9907530381860487 - Correlated feature: 494, Correlation value: 0.9969021470697893 Column: 223 - Correlated feature: 361, Correlation value: 0.9788025881044045 - Correlated feature: 495, Correlation value: 0.996676925775887 Column: 224 - Correlated feature: 362, Correlation value: 0.9957100391941062 - Correlated feature: 496, Correlation value: 0.8194134887715302 Column: 225 - Correlated feature: 363, Correlation value: 0.9634701123157071 - Correlated feature: 497, Correlation value: 0.9930712184247218 Column: 227 - Correlated feature: 365, Correlation value: 0.9676296259248571 Column: 228 - Correlated feature: 366, Correlation value: 0.968277262861418 Column: 238 - Correlated feature: 376, Correlation value: 0.9658610721174306 Column: 239 - Correlated feature: 377, Correlation value: 0.9544868890306727 Column: 248 - Correlated feature: 386, Correlation value: 0.9980170919646142 - Correlated feature: 520, Correlation value: 0.999731811527691 Column: 250 - Correlated feature: 388, Correlation value: 0.9746150086226992 - Correlated feature: 522, Correlation value: 0.9859708058369512 Column: 251 - Correlated feature: 389, Correlation value: 0.9999393815150169 - Correlated feature: 523, Correlation value: 0.9998371595428934 Column: 253 - Correlated feature: 391, Correlation value: 0.9871846062800703 - Correlated feature: 525, Correlation value: 0.9993620852658531 Column: 254 - Correlated feature: 392, Correlation value: 0.9883973534028441 - Correlated feature: 526, Correlation value: 0.9992562452131024 Column: 255 - Correlated feature: 393, Correlation value: 0.9854530337099378 - Correlated feature: 527, Correlation value: 0.9978308439843419 Column: 267 - Correlated feature: 405, Correlation value: 0.9898785503676681 - Correlated feature: 539, Correlation value: 0.998244575094897 Column: 268 - Correlated feature: 406, Correlation value: 0.9684979604492795 - Correlated feature: 540, Correlation value: 0.9998373075688736 Column: 269 - Correlated feature: 407, Correlation value: 0.9528274163070304 - Correlated feature: 541, Correlation value: 0.9706919203843701 Column: 367 Column: 368 Column: 418 Column: 419 Column: 423 Column: 432 Column: 433 Column: 438 Column: 439 Column: 460 Column: 468 Column: 472 Column: 474 Column: 476 Column: 482 Column: 483 Column: 484 Column: 485 Column: 486 Column: 487 Column: 488 Column: 489 Column: 499 Column: 500 Column: 510 Column: 511 Column: 542 Column: 543 - Correlated feature: 545, Correlation value: 0.9902529398191411 Column: 544 Column: 546 Column: 547 Column: 548 Column: 549 - Correlated feature: 552, Correlation value: 0.9953246070646439 - Correlated feature: 555, Correlation value: 0.8842852352120143 Column: 550 - Correlated feature: 553, Correlation value: 0.9801175941570109 - Correlated feature: 556, Correlation value: 0.998026273944607 Column: 551 - Correlated feature: 554, Correlation value: 0.997285432825001 - Correlated feature: 557, Correlation value: 0.9987443729016954 Column: 558 Column: 559 - Correlated feature: 560, Correlation value: 0.8914188658218072 - Correlated feature: 561, Correlation value: 0.9840869994551894 Column: 562 Column: 563 Column: 564 - Correlated feature: 566, Correlation value: 0.9831451962048945 - Correlated feature: 568, Correlation value: 0.9960828549317605 Column: 565 - Correlated feature: 567, Correlation value: 0.9889826223266871 - Correlated feature: 569, Correlation value: 0.9398851883504027 Column: 570 Column: 571 Column: 572 - Correlated feature: 574, Correlation value: 0.9936889370646438 - Correlated feature: 576, Correlation value: 0.9947721462418854 - Correlated feature: 577, Correlation value: 0.8637678317401384 Column: 573 - Correlated feature: 575, Correlation value: 0.9802654170338578 - Correlated feature: 577, Correlation value: 0.9578738556284137 Column: 582 Column: 583 - Correlated feature: 584, Correlation value: 0.9947709856890873 - Correlated feature: 585, Correlation value: 0.9998896745944839 Column: 586 Column: 587 - Correlated feature: 588, Correlation value: 0.9742756191414906 Column: 589 Column: Pass/Fail Column: Month Column: Day Column: Weekday Data shape after dropping correlated columns: (1567, 226) Unique features to drop: 335, 456, 187, 334, 275, 70, 289, 469, 196, 430, 320, 124, 299, 408, 123, 104, 555, 306, 471, 473, 477, 281, 174, 475, 392, 585, 455, 152, 410, 495, 442, 416, 202, 316, 286, 280, 291, 470, 359, 431, 341, 527, 541, 448, 526, 279, 65, 457, 493, 494, 140, 496, 576, 340, 343, 155, 389, 301, 204, 441, 435, 491, 412, 479, 356, 148, 274, 321, 480, 272, 339, 349, 588, 467, 294, 351, 355, 270, 406, 295, 437, 552, 60, 357, 566, 333, 560, 413, 285, 415, 36, 205, 584, 426, 324, 354, 105, 362, 366, 407, 545, 271, 304, 297, 303, 434, 127, 444, 350, 540, 317, 556, 522, 497, 377, 337, 376, 290, 454, 386, 523, 427, 300, 323, 130, 445, 425, 278, 554, 348, 344, 30, 338, 365, 154, 296, 409, 353, 388, 46, 424, 96, 305, 27, 283, 147, 490, 561, 391, 575, 429, 411, 106, 390, 277, 318, 298, 436, 302, 363, 568, 577, 520, 393, 197, 307, 26, 308, 331, 312, 273, 417, 309, 557, 207, 252, 287, 421, 405, 553, 164, 288, 452, 319, 525, 54, 447, 428, 440, 574, 420, 165, 361, 311, 336, 524, 199, 360, 310, 569, 443, 352, 539, 453, 459, 567, 98, 446, 282, 332 Number of unique features to drop: 210
Q.2.E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions.¶
So far, we have implemented several modifications to the dataset based on our analysis:
- We identified and dropped features with more than 20% null values to ensure data integrity.
- Features with constant values across all rows were identified and removed as they don't contribute to model building.
- Features with continuous variables but a minimal number of unique values were evaluated and removed.
- Features with predominantly zero values (>85%) were deleted as they lack variability.
- The 'Time' variable was transformed into 'Year', 'Month', 'Day', and 'DayofWeek' columns. Subsequently, the 'Time' and 'Year' columns were removed as they didn't provide meaningful information for model building.
- We addressed multicollinearity by dropping one variable from correlated pairs.
Moving forward, we can further refine our dataset for better model building:
- Identify features with very low coefficients of variation and consider dropping them to reduce noise in the data.
- Detect and handle outliers using techniques such as capping to ensure they don't unduly influence model training.l training.
We will now find features that has low variation in data. We will need to consider magnitude hence we will use coefficient of variance. Such columnns will not add any value to model building and prediction.
cv = (df_signal.drop(columns=['Pass/Fail']).std() / df_signal.drop(columns=['Pass/Fail']).mean()) * 100
# Set threshold for coefficient of variation
threshold_cv = 1 # Adjust as needed
# Get columns with coefficient of variation below threshold
low_cv_columns = cv[cv < threshold_cv]
# Print low coefficient of variation columns
print("Columns with Coefficient of Variation Below Threshold (%):")
for feature, cv_value in low_cv_columns.items():
print(f"{feature}: {cv_value}")
print("Total number of columns with Coefficient of Variation Below Threshold (%):", len(low_cv_columns))
# Drop columns with low coefficient of variation
df_signal.drop(columns=low_cv_columns.index, inplace=True)
print("Shape after dropping low coefficient of variation columns:", df_signal.shape)
Columns with Coefficient of Variation Below Threshold (%): 9: -1796.2107003686285 21: -11.149481939647643 23: -36.23678199684351 24: -971.4849514238811 37: 0.45913293308710335 38: 0.5143142843198214 55: 0.9003802930064821 56: 0.7319445470789336 57: 0.43937520411922093 75: -320.4481937747501 76: -112.10389721895426 77: -442.0552542121112 78: -348.20429682465306 80: -263.60110285481204 81: -79.84645755064969 93: -556.4937958123485 94: -596.2707287439314 100: -1668.4594882313404 101: -3043.7987517805227 103: -31.285814019013223 107: -4943.999382174064 108: -802.6196798866475 116: 0.9620318413245726 119: 0.921793688467005 121: 0.6288198378455077 129: -219.578567375626 131: 0.22490290003933688 133: 0.6494721896323363 582: 0.6805167071831172 Total number of columns with Coefficient of Variation Below Threshold (%): 29 Shape after dropping low coefficient of variation columns: (1567, 197)
Outlier treatment¶
Lets check and plot boxplots for few features to check existance of outliers. We have already noticed in statistical summary of data that there are outliers. We will use capping technique to bring outliers at lower (0.05) and upper fence (0.95). We will see boxplots for same features to verify treatment impact.
features_to_plot = df_signal.columns[:9]
# Create subplots
fig, axes = plt.subplots(1, len(features_to_plot), figsize=(15, 4))
# Iterate over each feature and create a boxplot
for i, feature in enumerate(features_to_plot):
df_signal[feature].plot(kind='box', ax=axes[i])
axes[i].set_title(feature)
axes[i].set_ylabel('Values')
plt.tight_layout()
plt.show()
# Define a function to cap outliers using min-max values
def cap_outliers(df, column):
# Calculate the 1st and 99th percentile
percentile_1 = df[column].quantile(0.01)
percentile_99 = df[column].quantile(0.99)
# Replace outliers with min-max values
df[column] = df[column].apply(lambda x: min(max(x, percentile_1), percentile_99))
# Apply the cap_outliers function to each column in df_signal
for column in df_signal.columns:
cap_outliers(df_signal, column)
features_to_plot = df_signal.columns[:10]
# Create subplots
fig, axes = plt.subplots(1, len(features_to_plot), figsize=(15, 4))
# Iterate over each feature and create a boxplot
for i, feature in enumerate(features_to_plot):
df_signal[feature].plot(kind='box', ax=axes[i])
axes[i].set_title(feature)
axes[i].set_ylabel('Values')
plt.tight_layout()
plt.show()
Observation We were able to treat outliers to good extent but some of the features still have outliers.
#Lets save column count before data processing
num_columns_after = df_signal.shape[1]
print("While data preprocessing, we were able to reduce the features those were not making impact from", num_columns_before, " to ",num_columns_after)
While data preprocessing, we were able to reduce the features those were not making impact from 592 to 197
Q.3. Data analysis & visualisation¶
Q.3.A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis¶
Lets analyze statistical summary of dataset
summary_stats = df_signal.describe()
print(summary_stats)
0 1 2 3 4 \
count 1567.000000 1567.000000 1567.000000 1567.000000 1567.000000
mean 3014.352520 2495.672110 2200.692253 1392.922725 1.367414
std 71.123887 76.940041 27.651479 418.700359 0.512210
min 2852.010000 2272.514800 2124.844400 867.302700 0.753100
25% 2966.665000 2452.885000 2181.099950 1083.885800 1.017700
50% 3011.840000 2498.910000 2200.955600 1287.353800 1.317100
75% 3056.540000 2538.745000 2218.055500 1590.169900 1.529600
max 3225.563800 2717.159200 2269.255600 2993.312984 4.197013
6 7 8 10 11 ... \
count 1567.000000 1567.000000 1567.000000 1567.000000 1567.000000 ...
mean 101.088216 0.122417 1.463046 0.000106 0.964623 ...
std 6.061338 0.001924 0.072505 0.008880 0.009218 ...
min 83.822200 0.117200 1.294410 -0.023140 0.943366 ...
25% 97.937800 0.121100 1.411250 -0.005600 0.958100 ...
50% 101.492200 0.122400 1.461600 0.000400 0.965800 ...
75% 104.530000 0.123800 1.516850 0.005900 0.971300 ...
max 119.354400 0.126800 1.617050 0.022536 0.980434 ...
572 573 583 586 587 \
count 1567.000000 1567.000000 1567.000000 1567.000000 1567.000000
mean 28.395781 0.339381 0.014734 0.021232 0.016355
std 86.028444 0.208919 0.004973 0.011180 0.008201
min 5.320000 0.123300 0.007800 -0.003400 0.004800
25% 7.500000 0.242250 0.011600 0.013450 0.010600
50% 8.650000 0.293400 0.013800 0.020500 0.014800
75% 10.130000 0.366900 0.016500 0.027600 0.020300
max 439.050000 1.330018 0.039336 0.054738 0.047410
589 Pass/Fail Month Day Weekday
count 1567.000000 1567.000000 1567.000000 1567.000000 1567.000000
mean 98.563043 -0.867262 7.409700 17.248883 3.171666
std 88.201686 0.498010 2.554511 7.613716 1.988605
min 0.000000 -1.000000 1.000000 8.000000 0.000000
25% 44.368600 -1.000000 7.000000 10.000000 1.000000
50% 72.023000 -1.000000 8.000000 17.000000 3.000000
75% 114.749700 -1.000000 9.000000 23.000000 5.000000
max 474.081200 1.000000 12.000000 31.000000 6.000000
[8 rows x 197 columns]
Inference(Statistical summary)
- There are many features with outliers
- There imbalance in Pass/Fail data
- Most of the data in months 7,8,9
Histoplt
df_signal.hist(figsize=(20, 30), bins=12)
plt.show()
Inference
- Many of the columns are left and right skewed showing existance of outliers
- There are many features with small unique values
- Distribution of the few columnms are normally distributed
Boxplot
#Lets plot boxplots for all columns
num_cols = 8
# Calculate the number of rows needed
num_features = len(df_signal.columns)
num_rows = (num_features - 1) // num_cols + 1
# Create subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 2.5*num_rows))
# Flatten the axes array to iterate over them
axes = axes.flatten()
# Iterate over each feature and create a boxplot
for i, feature in enumerate(df_signal.columns):
ax = axes[i] # Get the current axis
df_signal[feature].plot(kind='box', ax=ax)
ax.set_title(feature)
ax.set_ylabel('Values')
# Hide any empty subplots
for j in range(num_features, num_rows*num_cols):
axes[j].axis('off')
plt.tight_layout()
plt.show()
Inference(Boxplots)
- Many columns with outliers at both sides of data
- There are few features where plots are compressed at small size showing less variations
- Distribution of the few columnms are normally distributed
Piechart (Pass/Fail)
# prompt: Draw Pie chart to display Pass / Fail distribution
df_signal['Pass/Fail'].value_counts().plot.pie(autopct='%1.1f%%')
plt.show()
Inference
Distribution of Pass/Fail shown in above piechart clearlt states imbalance.
Q.3.B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis¶
Correlation with Target variable¶
It won't be practical plot any bivariate for each feature. However, we are inteding to find correlation of features with target variable so we will find top few features that has best correlation with Pass/Fail. We will see correlation of these features wil Pass/Fail using various plots.
ScatterPlot
# Compute the correlation matrix
correlation_matrix = df_signal.corr()
# Find the top 12 pairs of features with the highest absolute correlation
max_corr_pairs = correlation_matrix.abs().unstack().sort_values(ascending=False)
max_corr_pairs = max_corr_pairs[max_corr_pairs < 1] # Exclude self-correlation
# Initialize a set to store unique pairs
unique_pairs = set()
# Select the top 12 pairs of features
top_12_pairs = []
for (feature_1, feature_2), correlation in max_corr_pairs.items():
# Check if the pair or its reverse is already included in the set
if (feature_1, feature_2) not in unique_pairs and (feature_2, feature_1) not in unique_pairs:
top_12_pairs.append((feature_1, feature_2))
unique_pairs.add((feature_1, feature_2))
# Limit to the top 12 pairs
top_12_pairs = top_12_pairs[:12]
# Plot scatterplots with hue as Pass/Fail for each pair in a 3x4 matrix
fig, axes = plt.subplots(nrows=3, ncols=4, figsize=(20, 14))
for (feature_1, feature_2), ax in zip(top_12_pairs, axes.flatten()):
correlation_percentage = correlation_matrix.loc[feature_1, feature_2] * 100
sns.scatterplot(x=feature_1, y=feature_2, hue='Pass/Fail', data=df_signal, ax=ax, palette='coolwarm')
ax.set_title(f'{feature_1} vs {feature_2} (Correlation: {correlation_percentage:.2f}%) with Hue as Pass/Fail')
ax.set_xlabel(feature_1)
ax.set_ylabel(feature_2)
# Adjust layout
plt.tight_layout()
plt.show()
Observations : Though we have removed multicolinerity (correlation > 80%) by removing corelated columns, we can still see correlation because of outlier handling treatmnent. Outlier capping has resulted in increased corelation between some of the features. Same can be observed in above scatterplot. We have already removed features whose correlation was more than 90%. However we can still see some of the scatterplot showing more than 90% correlation and that is because of handling outliers and capping it to 99% quantile.
We can see both positive and negative correlation between many features. We are showing top 30 correlations.
Barplot
correlations = df_signal.corr()['Pass/Fail'].sort_values(ascending=False)
correlations = correlations.drop('Pass/Fail', axis=0)
top_20_features = correlations.abs().nlargest(20).index.tolist()
# Set up the figure and axes for bar plots
fig, axes = plt.subplots(nrows=4, ncols=6, figsize=(18, 13))
# Plot bar plots for top features by Pass/Fail
for i, feature in enumerate(top_20_features):
ax = axes[i//6, i%6] # Adjust the grid position
sns.barplot(x='Pass/Fail', y=feature, data=df_signal, ax=ax, estimator='mean', linewidth=0.5) # Reduce linewidth for smaller bars
ax.set_title(f'Bar Plot of {feature} by Pass/Fail')
ax.set_xlabel('Pass/Fail')
ax.set_ylabel('Mean Value')
# Adjust layout
plt.tight_layout()
plt.show()
Observations(BarPlot)
- Above barplot states distributions of Pass/Fail (target variable) against top 20 features that has highest correelation with Pass/Fail.
- It can be infered that most columns does not have any significant favouritism with Pass/Fail.
- Few of the columns has clear inclination towards Fail over Pass.
Violin Plot
fig, axes = plt.subplots(nrows=4, ncols=5, figsize=(15, 12))
# Plot violin plots for top features by Pass/Fail
for i, feature in enumerate(top_20_features):
ax = axes[i//5, i%5] # Adjust the grid position
sns.violinplot(x='Pass/Fail', y=feature, data=df_signal, ax=ax)
ax.set_title(f'Violin Plot of {feature} by Pass/Fail')
ax.set_xlabel('Pass/Fail')
ax.set_ylabel('Feature Value')
# Adjust layout
plt.tight_layout()
plt.show()
Observations(ViolinPlot)
- Above violin states distributions of Pass/Fail (target variable) against top 20 features that has highest correelation with Pass/Fail.
- It can be infered that most columns does not have any significant inclination towards specific class ie Pass/Fail.
- Few of the columns has clear inclination towards Fail over Pass.
- We can see some of the violins has clusters showing possible clusters in data.
Heatmap (Top 30 correlated features)
#Plot for high correlated features
# # Multivariate analysis: Heatmap of correlation matrix
# plt.figure(figsize=(12, 8))
# sns.heatmap(df_signal.corr(), annot=True, cmap='coolwarm')
# plt.title('Correlation Matrix')
# plt.show()
correlation_matrix = df_signal.corr()
# Extract the upper triangle of the correlation matrix (to avoid redundancy)
upper_triangle = np.triu(correlation_matrix)
# Flatten the upper triangle matrix and sort it to get the top 50 correlated feature pairs
correlation_values = correlation_matrix.abs().unstack()
sorted_correlation = correlation_values.sort_values(ascending=False)
# Select the top 50 correlated feature pairs
top_30_correlated = sorted_correlation[sorted_correlation != 1][:30]
# Extract the names of the top 50 correlated features
top_30_features = [(pair[0], pair[1]) for pair in top_30_correlated.index]
# Create a subset dataframe containing only the top 50 correlated features
df_top_30 = df_signal[[feature[0] for feature in top_30_features] + [feature[1] for feature in top_30_features]]
# Compute the correlation matrix for the top 50 features
correlation_matrix_top_30 = df_top_30.corr()
# Plot the heatmap
plt.figure(figsize=(15, 12))
sns.heatmap(correlation_matrix_top_30, cmap='rocket', fmt=".2f")
plt.title('Heatmap of Top 3 Correlated Features')
plt.show()
Observations(HeatMap)
- Again it will be impractical to visualize heatmap between all variables due to large number of columns. However we have extracted 30 highly correlated features to plot heatmap.
- It can be clearly seen that few of the columns are highly correlated. Though we have reduced multi-colinearity in data treatment, it has been reintroduced by outlier detection.
- Small Diagonals at either sides of main diagonal shows highly correlated features
Q.4. Data pre-processing¶
Q.4.A. Segregate predictors vs target attributes
# Separate the target variable from the features
X = df_signal.drop(columns=['Pass/Fail'])
y = df_signal['Pass/Fail']
print("Shape of the X data", X.shape)
print("Shape of the y data", y.shape)
Shape of the X data (1567, 196) Shape of the y data (1567,)
Q.4.B Check for target balancing and fix it if found imbalanced.¶
# Check the distribution of the target variable before balancing
# Check the distribution of the target variable
target_counts = y.value_counts()
print(target_counts)
plt.figure(figsize=(8, 6))
plt.bar(target_counts.index, target_counts.values)
plt.title('Distribution of Target Variable (Before Balancing)')
plt.xlabel('Target Classes')
plt.ylabel('Counts')
plt.xticks(target_counts.index, ['Pass(-1)', 'Fail(1)'])
plt.show()
Pass/Fail -1 1463 1 104 Name: count, dtype: int64
# prompt: Check for target balancing and fix it if found imbalanced.
print("Target variable is imbalanced.")
# Apply SMOTE to balance the target variable
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
# Check the distribution of the resampled target variable
target_counts_resampled = y_resampled.value_counts()
print("Target variable after SMOTE:")
print(target_counts_resampled)
Target variable is imbalanced. Target variable after SMOTE: Pass/Fail -1 1463 1 1463 Name: count, dtype: int64
Q.4.C. Perform train-test split and standardise the data or vice versa if required.¶
# prompt: Perform train-test split and standardise the data or vice versa if required.
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.25, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_scaled = scaler.fit_transform(X_resampled)
print(X_train.shape)
print(X_train_scaled.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(2194, 196) (2194, 196) (2194,) (732, 196) (732,)
We have split the data and standardized data using StandardScalar
Q.4.D. Check if the train and test data have similar statistical characteristics when compared with original data.¶
We can check statistical summarry for original dataframe, Training and testing dataset. However it will be impractical to visualize differences of all features. Hence we will print statistics values of few columns adjacent to each other and will manually chack simiaritis.
# prompt: Check if the train and test data have similar statistical characteristics when compared with original data.
# Calculate summary statistics for the original data
summary_stats_original = df_signal.describe()
# Calculate summary statistics for the oversampled data
summary_stats_resampled = pd.DataFrame(X_resampled).describe()
# Calculate summary statistics for the training data
summary_stats_train = pd.DataFrame(X_train).describe()
# Calculate summary statistics for the testing data
summary_stats_test = pd.DataFrame(X_test).describe()
# # Print the summary statistics for comparison
# print("Original balanced Data Summary Statistics:")
# print(summary_stats_resampled)
# print("\nTraining Data Summary Statistics:")
# print(summary_stats_train)
# print("\nTesting Data Summary Statistics:")
# print(summary_stats_test)
print("Original Data Shape:")
print(df_signal.shape)
print("\nResampled X Data Shape:")
print(X_resampled.shape)
print("\nTraining Data Shape:")
print(X_train.shape)
print("\nTesting Data Shape:")
print(X_test.shape)
from tabulate import tabulate
features = df_signal.columns[:4]
# Loop through the features and print statistics for each
for feature in features:
# Get describe statistics for the current feature from all datasets
original_stats = df_signal[feature].describe()
train_stats = X_train[feature].describe()
test_stats = X_test[feature].describe()
# Convert describe statistics to a list of lists for tabulate
stats_data = [
["Original Data", *original_stats.values],
["Train Data", *train_stats.values],
["Test Data", *test_stats.values]
]
# Print the tabulated statistics for the current feature
print(f"Statistics for feature '{feature}':")
print(tabulate(stats_data, headers=["Dataset", *original_stats.index.tolist()]))
print("\n") # Add a newline for separation between tables
Original Data Shape: (1567, 197) Resampled X Data Shape: (2926, 196) Training Data Shape: (2194, 196) Testing Data Shape: (732, 196) Statistics for feature '0': Dataset count mean std min 25% 50% 75% max ------------- ------- ------- ------- ------- ------- ------- ------- ------- Original Data 1567 3014.35 71.1239 2852.01 2966.66 3011.84 3056.54 3225.56 Train Data 2194 3011.59 70.1086 2852.01 2963.53 3004.01 3052.76 3225.56 Test Data 732 3007.59 67.2862 2852.01 2962.95 2997.84 3047.45 3225.56 Statistics for feature '1': Dataset count mean std min 25% 50% 75% max ------------- ------- ------- ------- ------- ------- ------- ------- ------- Original Data 1567 2495.67 76.94 2272.51 2452.89 2498.91 2538.74 2717.16 Train Data 2194 2495.98 67.1023 2272.51 2457.49 2498.31 2533.7 2717.16 Test Data 732 2493.02 71.8118 2272.51 2456.54 2499.85 2531.68 2717.16 Statistics for feature '2': Dataset count mean std min 25% 50% 75% max ------------- ------- ------- ------- ------- ------- ------- ------- ------- Original Data 1567 2200.69 27.6515 2124.84 2181.1 2200.96 2218.06 2269.26 Train Data 2194 2200.76 25.7693 2124.84 2181.62 2199.66 2217.1 2269.26 Test Data 732 2200.17 25.1466 2124.84 2183.46 2199.69 2217.04 2269.26 Statistics for feature '3': Dataset count mean std min 25% 50% 75% max ------------- ------- ------- ------- ------- ------- ------- ------- ------- Original Data 1567 1392.92 418.7 867.303 1083.89 1287.35 1590.17 2993.31 Train Data 2194 1370.61 360.575 867.303 1101.8 1293.35 1551.69 2993.31 Test Data 732 1383.38 356.718 867.303 1120.56 1309.7 1553.32 2993.31
Observations
Looking at the above comparison of statistical summary of 1st 4 columns, it can be infered that summary is almost similar if not exactly same.
Q.5. Model training, testing and tuning¶
Goal Statement¶
We have dataset of semiconductors along with its characteristics along with status 'Pass/Fail' when performed housline testing. Target variable represents values -1 (Pass) and 1 (Fail). In this case, missing out to identify 1 ie FAIL can cost high. Hence we would need to reduce false negative for class 1. Hence we will focus on maximizing recall score for class1.
Reusable Common functions¶
#Lets define a function to store result of each model/combination in dataframe results_df
columns = ['Model','train_acc','test_acc','train_recall','test_recall','train_precision','test_precision','Train_F1','Test_F1','KFold_score','SKF_score']
results_df = pd.DataFrame(columns=columns)
def AddModelResults(df, Model,train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,Train_F1,Test_F1, KFold_score = None, SKF_score= None):
if (df['Model'] == Model).any():
df.loc[df['Model'] == Model, ['Model','train_acc','test_acc','train_recall','test_recall','train_precision','test_precision','Train_F1','Test_F1','KFold_score','SKF_score']] = [Model, train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,Train_F1,Test_F1,KFold_score,SKF_score]
else:
# Append a new row
new_row = {'Model': Model,'train_acc' : train_acc,'test_acc':test_acc,'train_recall':train_recall,'test_recall':test_recall,'train_precision':train_precision,'test_precision':test_precision, 'Train_F1' : Train_F1,'Test_F1' : Test_F1,'KFold_score' : KFold_score, 'SKF_score' : SKF_score }
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
return df
def UpdateKFoldSKFScores(df, Model, KFold_score, SKF_score):
"""
Updates the KFold_score and SKF_score for a given model in the DataFrame.
Parameters:
- df: pandas.DataFrame - The DataFrame containing the model results.
- Model: str - The name of the model to update.
- KFold_score: float - The KFold cross-validation score to update.
- SKF_score: float - The Stratified KFold cross-validation score to update.
Returns:
- df: pandas.DataFrame - The updated DataFrame.
"""
if (df['Model'] == Model).any():
# Model exists, update the KFold_score and SKF_score
# print(df['Model'])
# print(Model)
# print(KFold_score)
# print(SKF_score)
df.loc[df['Model'] == Model, ['KFold_score', 'SKF_score']] = [KFold_score, SKF_score]
else:
# Model does not exist, warn the user
print(f"Warning: Model '{Model}' not found in the DataFrame. Consider adding the model first.")
return df
# function that actually used to train model and print/save the output of the model
def PrintOutput(dfr,name,Xtrain, Xtest, ytrain, ytest,pred_train, pred_test, ShowClassification = None, KFold_score = None, SKF_score= None):
# pred_train = np.round(pred_train,2)
# pred_test = np.round(pred_test,2)
if ShowClassification == None:
ShowClassification = True
train_acc = np.round(accuracy_score(ytrain,pred_train),2)
test_acc = np.round(accuracy_score(ytest,pred_test),2)
train_recall = np.round(recall_score(ytrain,pred_train, average='weighted'),2)
test_recall = np.round(recall_score(ytest,pred_test, average='weighted'),2)
train_precision = np.round(precision_score(ytrain,pred_train, average='weighted'),2)
test_precision = np.round(precision_score(ytest,pred_test, average='weighted'),2)
train_f1 = np.round(f1_score(ytrain,pred_train, average='weighted'),2)
test_f1 = np.round(f1_score(ytest, pred_test, average='weighted'),2)
classification_rep = classification_report(ytest, pred_test)
print('*'*15, name, ' Output Metrics', '*'*15)
print("Accuracy on training set : ",train_acc)
print("Accuracy on test set : ",test_acc)
print("Recall on training set: ",train_recall)
print("Recall on test set: ",test_recall)
print("Precision on training set: ",train_precision)
print("Precision on test set: ",test_precision)
print("F1 on train set: ",train_f1)
print("F1 on test set: ",test_f1)
if ShowClassification != False:
print("Classification Report on test data:")
print(classification_rep)
return train_acc, train_recall, train_precision, train_f1,test_acc, test_recall, test_precision, test_f1, AddModelResults(dfr, name,train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1, KFold_score , SKF_score)
Q.5.A. Use any Supervised Learning technique to train a model¶
Lets use Logistics regression to start with
#Lets use logistics regression to start with
# Define and train the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
y_pred_trainLr = log_reg.predict(X_train_scaled)
y_pred_testLr = log_reg.predict(X_test_scaled)
accuracyLr, precisionLr, recallLr, f1Lr,accuracy_testLr, precision_testLr, recall_testLr, f1_testLr, results_df = PrintOutput(results_df,'Logistics Regression',X_train_scaled, X_test_scaled,y_train, y_test,y_pred_trainLr, y_pred_testLr)
*************** Logistics Regression Output Metrics ***************
Accuracy on training set : 0.94
Accuracy on test set : 0.89
Recall on training set: 0.94
Recall on test set: 0.89
Precision on training set: 0.95
Precision on test set: 0.9
F1 on train set: 0.94
F1 on test set: 0.89
Classification Report on test data:
precision recall f1-score support
-1 0.96 0.82 0.89 371
1 0.84 0.96 0.90 361
accuracy 0.89 732
macro avg 0.90 0.89 0.89 732
weighted avg 0.90 0.89 0.89 732
Lets now try Random forest
rf = RandomForestClassifier()
rf.fit(X_train_scaled, y_train)
y_pred_train = rf.predict(X_train_scaled)
y_pred_test = rf.predict(X_test_scaled)
accuracy_rf, precision_rf, recall_rf, f1,accuracy_test_rf, precision_test_rf, recall_test_rf, f1_test_rf, results_df = PrintOutput(results_df,'Random Forest',X_train_scaled, X_test_scaled,y_train, y_test,y_pred_train, y_pred_test)
*************** Random Forest Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.99
Recall on training set: 1.0
Recall on test set: 0.99
Precision on training set: 1.0
Precision on test set: 0.99
F1 on train set: 1.0
F1 on test set: 0.99
Classification Report on test data:
precision recall f1-score support
-1 0.99 0.99 0.99 371
1 0.99 0.99 0.99 361
accuracy 0.99 732
macro avg 0.99 0.99 0.99 732
weighted avg 0.99 0.99 0.99 732
Q.5.B. Use cross validation techniques.¶
def perform_cross_validation(model,results_df, model_name, X, y, cv=None) :
"""
Perform KFold and SKF cross-validation for a given model and update the results dataframe with scores.
Parameters:
- model: The machine learning model to be evaluated.
- X: The feature matrix.
- y: The target vector.
- cv: Number of folds for KFold cross-validation. Default is 10.
- skf: StratifiedKFold object for SKF cross-validation. Default is None.
- results_df: Results dataframe to store the scores. Default is None.
- model_name: Name of the model. Default is None.
Returns:
- results_df: Updated results dataframe with scores.
- skf_mean_score: Mean score of SKF cross-validation.
"""
if cv is None:
cv = 10
print(f"---------------------KFold Cross-validation for {model_name}----------------------")
scores_rf = cross_val_score(model, X, y, cv=cv)
# print(f"Cross-validation scores ({model_name}):", scores_rf)
rf_mean_score = scores_rf.mean()
print(f"Average Kfold cross-validation score {model_name}:", rf_mean_score)
print(f"---------------------SKF Cross-validation for {model_name}----------------------")
skf = StratifiedKFold(n_splits=cv)
Skfscores = cross_val_score(model, X, y, cv=skf)
# Print the cross-validation scores
# print("Cross-validation scores:", Skfscores)
skf_mean_score = Skfscores.mean()
print(f"Average skf cross-validation score for {model_name}:", skf_mean_score)
if results_df is not None and model_name is not None:
results_df = UpdateKFoldSKFScores(results_df, model_name, rf_mean_score, skf_mean_score)
return results_df
We will use Kfold cross validation and Skf cross validation. We will use original data for cross validation using direct function. However we will transform data when used in loop of cross validation.
# prompt: Use cross validation techniques.
results_df = perform_cross_validation(log_reg,results_df,'Logistics Regression', X_train_scaled, y_train, cv=5)
results_df = perform_cross_validation(rf,results_df, 'Random Forest', X_train_scaled, y_train, cv=5)
---------------------KFold Cross-validation for Logistics Regression---------------------- Average Kfold cross-validation score Logistics Regression: 0.899267742170354 ---------------------SKF Cross-validation for Logistics Regression---------------------- Average skf cross-validation score for Logistics Regression: 0.899267742170354 ---------------------KFold Cross-validation for Random Forest---------------------- Average Kfold cross-validation score Random Forest: 0.9840411478973591 ---------------------SKF Cross-validation for Random Forest---------------------- Average skf cross-validation score for Random Forest: 0.9831351868609646
# Lets try SKF cross validation in manual manner using for loop where we transform data manually
# Define the number of folds for cross-validation
n_splits = 5
# Initialize StratifiedKFold for cross-validation
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
# Initialize StandardScaler for scaling features
scaler = StandardScaler()
# Initialize SMOTE for oversampling the minority class
smote = SMOTE()
# Initialize lists to store evaluation scores
cross_val_scores = []
# Perform cross-validation
for train_index, test_index in skf.split(X, y):
# Split data into train and test sets
X_train1, X_test1 = X.iloc[train_index], X.iloc[test_index]
y_train1, y_test1 = y[train_index], y[test_index]
# Apply scaling to training and test sets
X_train_scaled1 = scaler.fit_transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)
# Apply SMOTE to balance the training set
X_train_resampled1, y_train_resampled1 = smote.fit_resample(X_train_scaled1, y_train1)
# Train the classifier on the resampled training data
log_reg.fit(X_train_resampled1, y_train_resampled1)
# Evaluate the classifier on the test data
score = log_reg.score(X_test_scaled1, y_test1)
cross_val_scores.append(score)
# Calculate and print the average cross-validation score
average_score = sum(cross_val_scores) / len(cross_val_scores)
print("Average cross-validation score:", average_score)
Average cross-validation score: 0.8257809161392726
Cross Validation Conclusion It has been observed that Random Forest model is doing well when checked in cross validation. So we will continnue with Random forest model for further process. i.e Parameter tuning, PCA.
Q.5.C. Apply hyper-parameter tuning techniques to get the best accuracy.¶
# prompt: Apply hyper-parameter tuning techniques to get the best accuracy.
# Define the parameter grid for grid search
param_grid = {
'n_estimators': [50, 100],
'max_depth': [10, 20],
}
Rm_Fst = RandomForestClassifier()
# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=Rm_Fst, param_grid=param_grid, cv=5)
# Fit the grid search object to the training data
grid_search.fit(X_train_scaled, y_train)
# Get the best parameters
best_params = grid_search.best_params_
# Print the best parameters
print("Best parameters:", best_params)
# Create a new SVM classifier with the best parameters
# best_rf_classifier = RandomForestClassifier(**best_params)
best_rf_classifier = grid_search.best_estimator_
# Train the best classifier
best_rf_classifier.fit(X_train_scaled, y_train)
# Predict the class labels for the test data
y_pred_trainRfGv = best_rf_classifier.predict(X_train_scaled)
y_pred_testRfGv = best_rf_classifier.predict(X_test_scaled)
# Print the performance metrics
accuracy_svm_tuned, precision_svm_tuned, recall_svm_tuned, f1,accuracy_test_svm_tuned, precision_test_svm_tuned, recall_test_svm_tuned, f1_test_svm_tuned, results_df = PrintOutput(results_df,'Random Forest (Tuned) model',X_train_scaled, X_test_scaled,y_train, y_test,y_pred_trainRfGv, y_pred_testRfGv)
Best parameters: {'max_depth': 20, 'n_estimators': 100}
*************** Random Forest (Tuned) model Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.99
Recall on training set: 1.0
Recall on test set: 0.99
Precision on training set: 1.0
Precision on test set: 0.99
F1 on train set: 1.0
F1 on test set: 0.99
Classification Report on test data:
precision recall f1-score support
-1 0.99 0.99 0.99 371
1 0.99 0.99 0.99 361
accuracy 0.99 732
macro avg 0.99 0.99 0.99 732
weighted avg 0.99 0.99 0.99 732
Q.5.D Use any other technique/method which can enhance the model performance¶
- As there are large number of columns in data, it will be worth checking performance after dimensionality reduction.
- We need to ensure that we dont lose data while doing dimensionality reduction. Hence we wont use forward or backward selection techniques as they use selective features.
- Hence we will use PCA dimensionality reduction to reduce features ensuring data is not lost
Note :
- We have already scaled and balanced data (X_scaled) so we will continue using same for PCA.
- We will not standardize the data again as PCA will be done scaled data only
- We will use X data obtained from PCA along with balanced y data to further split into train and test
- As we will be using X_scaled and y_resampled which is already balanced, no need to further balance/scale the same before/after doing PCA.
# prompt: lets reduce dimension using PCA dimensionality reduction technique
# Perform PCA with 65 components
pca = PCA(n_components=65)
X_pca = pca.fit_transform(X_scaled)
# Split the PCA-transformed data into training and testing sets
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y_resampled, test_size=0.25, random_state=42)
# Train the SVM classifier on the PCA-transformed and scaled data
rf_classifier_pca = RandomForestClassifier()
rf_classifier_pca.fit(X_train_pca, y_train_pca)
y_pred_trainPca = rf_classifier_pca.predict(X_train_pca)
y_pred_testPca = rf_classifier_pca.predict(X_test_pca)
# Print the performance metrics
accuracy_svm_pca, precision_svm_pca, recall_svm_pca, f1,accuracy_test_svm_pca, precision_test_svm_pca, recall_test_svm_pca, f1_test_svm_pca, results_df = PrintOutput(results_df,'Random Forest model with PCA',X_train_pca, X_test_pca, y_train_pca, y_test_pca,y_pred_trainPca, y_pred_testPca)
*************** Random Forest model with PCA Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.98
Recall on training set: 1.0
Recall on test set: 0.98
Precision on training set: 1.0
Precision on test set: 0.98
F1 on train set: 1.0
F1 on test set: 0.98
Classification Report on test data:
precision recall f1-score support
-1 0.98 0.98 0.98 371
1 0.98 0.98 0.98 361
accuracy 0.98 732
macro avg 0.98 0.98 0.98 732
weighted avg 0.98 0.98 0.98 732
#Cross validation for above model
results_df = perform_cross_validation(rf_classifier_pca,results_df, 'Random Forest model with PCA', X_train_pca, y_train_pca, cv=4)
---------------------KFold Cross-validation for Random Forest model with PCA---------------------- Average Kfold cross-validation score Random Forest model with PCA: 0.9813130708787046 ---------------------SKF Cross-validation for Random Forest model with PCA---------------------- Average skf cross-validation score for Random Forest model with PCA: 0.9799411338465425
Lets now check coverage of the variance explained with number of features we selected for PCA
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Plot your data on the first subplot
plt1 = ax1.plot(range(1, len(pca.explained_variance_ratio_) + 1),
np.cumsum(pca.explained_variance_ratio_), marker='o',color='green')
ax1.set_xlabel('Number of Components')
ax1.set_ylabel('Cumulative Variance Explained')
ax1.set_title('Cumulative Variance Explained with Number of Components')
# Draw a horizontal line at 90% cumulative variance explained
ax1.axhline(y=0.9, color='red', linestyle='--')
ax1.grid(True)
# Plot your data on the second subplot and present in steps
ax2.bar(list(range(1, len(pca.explained_variance_ratio_) + 1)),pca.explained_variance_ratio_,alpha=0.5, align='center',color='blue')
ax2.step(list(range(1, len(pca.explained_variance_ratio_) + 1)),np.cumsum(pca.explained_variance_ratio_), where='mid',color='blue')
ax2.set_title('Cumulative Variance Explained with steps')
ax2.set_ylabel('Variation explained')
ax2.set_xlabel('# of PCA Components')
# Draw a horizontal line at 90% cumulative variance explained
ax2.axhline(y=0.9, color='red', linestyle='--')
ax2.grid(True)
plt.tight_layout() # Adjust layout to prevent overlapping
plt.show()
Let's use hyperparameter tuning for above random forest model for PCA data
param_grid = {
'n_estimators': [100, 150],
'max_depth': [10, 20],
'min_samples_split': [2],
'min_samples_leaf': [1, 2]
}
rf_classifier = RandomForestClassifier()
# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5)
# Fit the grid search object to the training data
grid_search.fit(X_train_pca, y_train_pca)
# Get the best parameters
best_params = grid_search.best_params_
# Print the best parameters
print("Best parameters:", best_params)
# Create a new Random Forest classifier with the best parameters
# best_rf_classifier = RandomForestClassifier(**best_params)
best_rf_classifier = grid_search.best_estimator_
# Train the best classifier
best_rf_classifier.fit(X_train_pca, y_train_pca)
# Predict the class labels for the test data
y_pred_trainRfPca = best_rf_classifier.predict(X_train_pca)
y_pred_testRfPca = best_rf_classifier.predict(X_test_pca)
# Print the performance metrics
accuracy_rf_pca, precision_rf_pca, recall_rf_pca, f1,accuracy_test_rf_pca, precision_test_rf_pca, recall_test_rf_pca, f1_test_rf_pca, results_df = PrintOutput(results_df,'Random Forest (Tuned) model PCA',X_train_pca, X_test_pca,y_train_pca, y_test_pca,y_pred_trainRfPca, y_pred_testRfPca, False)
Best parameters: {'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
*************** Random Forest (Tuned) model PCA Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.97
Recall on training set: 1.0
Recall on test set: 0.97
Precision on training set: 1.0
Precision on test set: 0.97
F1 on train set: 1.0
F1 on test set: 0.97
results_df = perform_cross_validation(best_rf_classifier,results_df, 'Random Forest (Tuned) model PCA', X_train_pca, y_train_pca, cv=3)
---------------------KFold Cross-validation for Random Forest (Tuned) model PCA---------------------- Average Kfold cross-validation score Random Forest (Tuned) model PCA: 0.9762969109361879 ---------------------SKF Cross-validation for Random Forest (Tuned) model PCA---------------------- Average skf cross-validation score for Random Forest (Tuned) model PCA: 0.9767553990715615
Observations (PCA)
- We transformed the features and reduced significantly using PCA.
- Model executed on data with reduced features are expected to consume less time, resources.
- Importantly, it can be see that performance has not reduced when executed model on PCA data. Same can be observed from metrics and cross validation.
Q.5.E. Display and explain the classification report in detail.¶
# Print the performance metrics
accuracy_rf_pca, precision_rf_pca, recall_rf_pca, f1,accuracy_test_rf_pca, precision_test_rf_pca, recall_test_rf_pca, f1_test_rf_pca, results_df = PrintOutput(results_df,'Random Forest (Tuned) model PCA',X_train_pca, X_test_pca,y_train_pca, y_test_pca,y_pred_trainRfPca, y_pred_testRfPca)
*************** Random Forest (Tuned) model PCA Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.97
Recall on training set: 1.0
Recall on test set: 0.97
Precision on training set: 1.0
Precision on test set: 0.97
F1 on train set: 1.0
F1 on test set: 0.97
Classification Report on test data:
precision recall f1-score support
-1 0.98 0.97 0.97 371
1 0.97 0.98 0.97 361
accuracy 0.97 732
macro avg 0.97 0.97 0.97 732
weighted avg 0.97 0.97 0.97 732
classification report
- Above classification report is for test data.
- The classification report offers a comprehensive breakdown of precision, recall, and F1-score for each class (-1 and 1). Remarkably, both classes demonstrate high precision, recall, and F1-score, indicating the model's strong performance for both positive and negative instances.
- As outlined in the project's objective, there is a specific business requirement to enhance the Recall score of class 1.
- Recall: A high Recall score for a particular class signifies a low count of false negatives (FN) for that class. Notably, the recall scores for both classes are notably high, suggesting minimal occurrences of false negatives.
- Precision: Precision score reflects the incidence of false positives (FP). High precision is indicative of a low count of false positives. In the presented classification report, precision scores for both classes are robust, suggesting a minimal occurrence of false positives.
- F1 score: The F1-score, a harmonic mean of precision and recall, offers a balanced assessment of the model's performance. The elevated F1-scores on test sets indicate a harmonious balance between precision and recall.
- Support: Support denotes the number of samples in each class within the test set, providing crucial context for interpreting other metrics. Due to data balancing using SMOTE, a nearly equal number of samples are present for each class.
- In this scenario, where -1 represents the pass outcome in houseline testing and 1 signifies failure, the focus is naturally on class 1 and its recall. Remarkably, both recall and precision metrics fare well for both classes, suggesting minimal occurrences of false negatives and false positives for both classes.
Overall, the model demonstrates strong performance across all metrics, achieving high accuracy, recall, precision, and F1-score on both the training and test sets. This suggests that the model is effective in classifying instances from both classes and generalizes well to unseen data.unseen data.
Q.5.F. Apply the above steps for all possible models that you have learnt so far.¶
Repeating all Q.5 steps for various models
We will utilize a Pipeline to execute the following procedures for each model incorporated in the Pipeline:
- Conduct basic model training using the original balanced scaled training data.
- Implement K-fold and SKF cross-validation on the original balanced scaled training data.
- Display and store output metrics alongside the training and testing original balanced scaled data.
- Fine-tune the model's hyperparameters using GridSearchCV on the PCA-transformed data, which was previously transformed.
- Once more, exhibit and save output metrics for the PCA-transformed train and test data. data.
# Define the models for pipeline
models = {
'Logistic Regression': LogisticRegression(max_iter=200, random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
# 'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(random_state=42),
'Naive Bayes': GaussianNB(),
'KNeighbors Classifier': KNeighborsClassifier()
}
# Define the hyperparameter grid for each model
param_grid = {
'Logistic Regression': {
'model__C': [0.1, 1, 10]
},
'Decision Tree': {
'model__max_depth': [5, 10, 15],
'model__min_samples_split': [2, 5, 10]
},
'Random Forest': {
'model__n_estimators': [50, 100],
'model__max_depth': [10, 15]
},
'KNeighbors Classifier': {
'model__n_neighbors': [3, 5, 7]
},
'SVM': {
'model__C': [0.1, 1, 10],
'model__kernel': ['linear', 'rbf'],
'model__gamma': [0.01, 0.1]
},
'Naive Bayes': {}
}
# Create a pipeline for each model
pipelines = {
model_name: Pipeline([
('model', model)
]) for model_name, model in models.items()
}
#Run loop for each pipeline model
for model_name, pipeline in pipelines.items():
print(f"Processing {model_name}...")
# Initial training on scaled data (X_train_scaled should be defined similar to X_scaled but just for the training split)
pipeline.fit(X_train_scaled, y_train)
y_pred_trainP = pipeline.predict(X_train_scaled)
y_pred_testP = pipeline.predict(X_test_scaled)
accuracy, precision, recall, f1,accuracy_test, precision_test, recall_test, f1_test, results_df = PrintOutput(results_df,model_name,X_train_scaled, X_test_scaled,y_train, y_test,y_pred_trainP, y_pred_testP)
#*************************************Cross validation**********************************
results_df = perform_cross_validation(pipeline,results_df, model_name, X_train, y_train, cv=5)
#*************************************Print and store output**********************************
print(f"Initial training completed for {model_name}")
# Setup GridSearchCV for hyperparameter tuning on PCA-transformed data
if model_name in param_grid: # Ensure we have hyperparameters defined for the model
pipeline.fit(X_train_pca, y_train_pca) #Again train pipeline model on PCA transformed data
grid_search = GridSearchCV(pipeline, param_grid[model_name], cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_pca, y_train_pca) # Fit gridsearch CV object Using PCA-transformed data
y_pred_trainPGV = pipeline.predict(X_train_pca)
y_pred_testPGV = pipeline.predict(X_test_pca)
#*************************************Print and store output for Tuned model**********************************
m_name = model_name + ' (Tuned on PCA data)'
accuracy, precision, recall, f1,accuracy_test, precision_test, recall_test, f1_test, results_df = PrintOutput(results_df,m_name,X_train_pca, X_test_pca,y_train_pca, y_test_pca,y_pred_trainPGV, y_pred_testPGV)
#*************************************Cross validation**********************************
results_df = perform_cross_validation(pipeline,results_df, m_name, X_train_pca, y_train_pca, cv=5)
else:
print(f"No hyperparameter tuning for {model_name}")
Processing Logistic Regression...
*************** Logistic Regression Output Metrics ***************
Accuracy on training set : 0.94
Accuracy on test set : 0.89
Recall on training set: 0.94
Recall on test set: 0.89
Precision on training set: 0.95
Precision on test set: 0.9
F1 on train set: 0.94
F1 on test set: 0.89
Classification Report on test data:
precision recall f1-score support
-1 0.96 0.83 0.89 371
1 0.84 0.96 0.90 361
accuracy 0.89 732
macro avg 0.90 0.89 0.89 732
weighted avg 0.90 0.89 0.89 732
---------------------KFold Cross-validation for Logistic Regression----------------------
Average Kfold cross-validation score Logistic Regression: 0.7105802935272153
---------------------SKF Cross-validation for Logistic Regression----------------------
Average skf cross-validation score for Logistic Regression: 0.7105802935272153
Initial training completed for Logistic Regression
*************** Logistic Regression (Tuned on PCA data) Output Metrics ***************
Accuracy on training set : 0.81
Accuracy on test set : 0.77
Recall on training set: 0.81
Recall on test set: 0.77
Precision on training set: 0.81
Precision on test set: 0.77
F1 on train set: 0.81
F1 on test set: 0.77
Classification Report on test data:
precision recall f1-score support
-1 0.77 0.76 0.77 371
1 0.76 0.77 0.77 361
accuracy 0.77 732
macro avg 0.77 0.77 0.77 732
weighted avg 0.77 0.77 0.77 732
---------------------KFold Cross-validation for Logistic Regression (Tuned on PCA data)----------------------
Average Kfold cross-validation score Logistic Regression (Tuned on PCA data): 0.7898908894228269
---------------------SKF Cross-validation for Logistic Regression (Tuned on PCA data)----------------------
Average skf cross-validation score for Logistic Regression (Tuned on PCA data): 0.7898908894228269
Processing Decision Tree...
*************** Decision Tree Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.87
Recall on training set: 1.0
Recall on test set: 0.87
Precision on training set: 1.0
Precision on test set: 0.87
F1 on train set: 1.0
F1 on test set: 0.87
Classification Report on test data:
precision recall f1-score support
-1 0.90 0.84 0.87 371
1 0.84 0.91 0.87 361
accuracy 0.87 732
macro avg 0.87 0.87 0.87 732
weighted avg 0.87 0.87 0.87 732
---------------------KFold Cross-validation for Decision Tree----------------------
Average Kfold cross-validation score Decision Tree: 0.8705526258308112
---------------------SKF Cross-validation for Decision Tree----------------------
Average skf cross-validation score for Decision Tree: 0.8705526258308112
Initial training completed for Decision Tree
*************** Decision Tree (Tuned on PCA data) Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.86
Recall on training set: 1.0
Recall on test set: 0.86
Precision on training set: 1.0
Precision on test set: 0.87
F1 on train set: 1.0
F1 on test set: 0.86
Classification Report on test data:
precision recall f1-score support
-1 0.89 0.83 0.86 371
1 0.84 0.90 0.87 361
accuracy 0.86 732
macro avg 0.86 0.86 0.86 732
weighted avg 0.87 0.86 0.86 732
---------------------KFold Cross-validation for Decision Tree (Tuned on PCA data)----------------------
Average Kfold cross-validation score Decision Tree (Tuned on PCA data): 0.8509522472202287
---------------------SKF Cross-validation for Decision Tree (Tuned on PCA data)----------------------
Average skf cross-validation score for Decision Tree (Tuned on PCA data): 0.8509522472202287
Processing SVM...
*************** SVM Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 1.0
Recall on training set: 1.0
Recall on test set: 1.0
Precision on training set: 1.0
Precision on test set: 1.0
F1 on train set: 1.0
F1 on test set: 1.0
Classification Report on test data:
precision recall f1-score support
-1 1.00 1.00 1.00 371
1 1.00 1.00 1.00 361
accuracy 1.00 732
macro avg 1.00 1.00 1.00 732
weighted avg 1.00 1.00 1.00 732
---------------------KFold Cross-validation for SVM----------------------
Average Kfold cross-validation score SVM: 0.6276375323743253
---------------------SKF Cross-validation for SVM----------------------
Average skf cross-validation score for SVM: 0.6276375323743253
Initial training completed for SVM
*************** SVM (Tuned on PCA data) Output Metrics ***************
Accuracy on training set : 1.0
Accuracy on test set : 0.99
Recall on training set: 1.0
Recall on test set: 0.99
Precision on training set: 1.0
Precision on test set: 0.99
F1 on train set: 1.0
F1 on test set: 0.99
Classification Report on test data:
precision recall f1-score support
-1 1.00 0.98 0.99 371
1 0.98 1.00 0.99 361
accuracy 0.99 732
macro avg 0.99 0.99 0.99 732
weighted avg 0.99 0.99 0.99 732
---------------------KFold Cross-validation for SVM (Tuned on PCA data)----------------------
Average Kfold cross-validation score SVM (Tuned on PCA data): 0.979491580075098
---------------------SKF Cross-validation for SVM (Tuned on PCA data)----------------------
Average skf cross-validation score for SVM (Tuned on PCA data): 0.979491580075098
Processing Naive Bayes...
*************** Naive Bayes Output Metrics ***************
Accuracy on training set : 0.86
Accuracy on test set : 0.87
Recall on training set: 0.86
Recall on test set: 0.87
Precision on training set: 0.86
Precision on test set: 0.87
F1 on train set: 0.86
F1 on test set: 0.87
Classification Report on test data:
precision recall f1-score support
-1 0.88 0.87 0.87 371
1 0.87 0.88 0.87 361
accuracy 0.87 732
macro avg 0.87 0.87 0.87 732
weighted avg 0.87 0.87 0.87 732
---------------------KFold Cross-validation for Naive Bayes----------------------
Average Kfold cross-validation score Naive Bayes: 0.8122101912815551
---------------------SKF Cross-validation for Naive Bayes----------------------
Average skf cross-validation score for Naive Bayes: 0.8122101912815551
Initial training completed for Naive Bayes
*************** Naive Bayes (Tuned on PCA data) Output Metrics ***************
Accuracy on training set : 0.88
Accuracy on test set : 0.88
Recall on training set: 0.88
Recall on test set: 0.88
Precision on training set: 0.89
Precision on test set: 0.88
F1 on train set: 0.88
F1 on test set: 0.88
Classification Report on test data:
precision recall f1-score support
-1 0.87 0.90 0.89 371
1 0.90 0.86 0.88 361
accuracy 0.88 732
macro avg 0.88 0.88 0.88 732
weighted avg 0.88 0.88 0.88 732
---------------------KFold Cross-validation for Naive Bayes (Tuned on PCA data)----------------------
Average Kfold cross-validation score Naive Bayes (Tuned on PCA data): 0.8764845383343214
---------------------SKF Cross-validation for Naive Bayes (Tuned on PCA data)----------------------
Average skf cross-validation score for Naive Bayes (Tuned on PCA data): 0.8764845383343214
Processing KNeighbors Classifier...
*************** KNeighbors Classifier Output Metrics ***************
Accuracy on training set : 0.61
Accuracy on test set : 0.55
Recall on training set: 0.61
Recall on test set: 0.55
Precision on training set: 0.78
Precision on test set: 0.76
F1 on train set: 0.53
F1 on test set: 0.44
Classification Report on test data:
precision recall f1-score support
-1 1.00 0.11 0.20 371
1 0.52 1.00 0.69 361
accuracy 0.55 732
macro avg 0.76 0.56 0.45 732
weighted avg 0.76 0.55 0.44 732
---------------------KFold Cross-validation for KNeighbors Classifier----------------------
Average Kfold cross-validation score KNeighbors Classifier: 0.7634453562996016
---------------------SKF Cross-validation for KNeighbors Classifier----------------------
Average skf cross-validation score for KNeighbors Classifier: 0.7634453562996016
Initial training completed for KNeighbors Classifier
*************** KNeighbors Classifier (Tuned on PCA data) Output Metrics ***************
Accuracy on training set : 0.9
Accuracy on test set : 0.85
Recall on training set: 0.9
Recall on test set: 0.85
Precision on training set: 0.91
Precision on test set: 0.89
F1 on train set: 0.89
F1 on test set: 0.85
Classification Report on test data:
precision recall f1-score support
-1 1.00 0.71 0.83 371
1 0.77 1.00 0.87 361
accuracy 0.85 732
macro avg 0.88 0.85 0.85 732
weighted avg 0.89 0.85 0.85 732
---------------------KFold Cross-validation for KNeighbors Classifier (Tuned on PCA data)----------------------
Average Kfold cross-validation score KNeighbors Classifier (Tuned on PCA data): 0.8368282002475531
---------------------SKF Cross-validation for KNeighbors Classifier (Tuned on PCA data)----------------------
Average skf cross-validation score for KNeighbors Classifier (Tuned on PCA data): 0.8368282002475531
Q.6. Post Training and Conclusion¶
Q.6.A. Display and compare all the models designed with their train and test accuracies¶
results_df
| Model | train_acc | test_acc | train_recall | test_recall | train_precision | test_precision | Train_F1 | Test_F1 | KFold_score | SKF_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Logistics Regression | 0.94 | 0.89 | 0.94 | 0.89 | 0.95 | 0.90 | 0.94 | 0.89 | 0.899268 | 0.899268 |
| 1 | Random Forest | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 0.984041 | 0.983135 |
| 2 | Random Forest (Tuned) model | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | None | None |
| 3 | Random Forest model with PCA | 1.00 | 0.98 | 1.00 | 0.98 | 1.00 | 0.98 | 1.00 | 0.98 | 0.981313 | 0.979941 |
| 4 | Random Forest (Tuned) model PCA | 1.00 | 0.97 | 1.00 | 0.97 | 1.00 | 0.97 | 1.00 | 0.97 | None | None |
| 5 | Logistic Regression | 0.94 | 0.89 | 0.94 | 0.89 | 0.95 | 0.90 | 0.94 | 0.89 | 0.71058 | 0.71058 |
| 6 | Logistic Regression (Tuned on PCA data) | 0.81 | 0.77 | 0.81 | 0.77 | 0.81 | 0.77 | 0.81 | 0.77 | 0.789891 | 0.789891 |
| 7 | Decision Tree | 1.00 | 0.87 | 1.00 | 0.87 | 1.00 | 0.87 | 1.00 | 0.87 | 0.870553 | 0.870553 |
| 8 | Decision Tree (Tuned on PCA data) | 1.00 | 0.86 | 1.00 | 0.86 | 1.00 | 0.87 | 1.00 | 0.86 | 0.850952 | 0.850952 |
| 9 | SVM | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.627638 | 0.627638 |
| 10 | SVM (Tuned on PCA data) | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | 0.979492 | 0.979492 |
| 11 | Naive Bayes | 0.86 | 0.87 | 0.86 | 0.87 | 0.86 | 0.87 | 0.86 | 0.87 | 0.81221 | 0.81221 |
| 12 | Naive Bayes (Tuned on PCA data) | 0.88 | 0.88 | 0.88 | 0.88 | 0.89 | 0.88 | 0.88 | 0.88 | 0.876485 | 0.876485 |
| 13 | KNeighbors Classifier | 0.61 | 0.55 | 0.61 | 0.55 | 0.78 | 0.76 | 0.53 | 0.44 | 0.763445 | 0.763445 |
| 14 | KNeighbors Classifier (Tuned on PCA data) | 0.90 | 0.85 | 0.90 | 0.85 | 0.91 | 0.89 | 0.89 | 0.85 | 0.836828 | 0.836828 |
Above table includes, output and metrics for each of the models we ran so far.
- Model : Name of the model
- train_acc : Accuracy on training data
- test_acc : Accuracy on testing data
- train_recall : Recall on training data
- test_recall : Recall on testing data
- train_precision : Precision on training data
- test_precision : Precision on tetsing data
- Train_F1 : F1 score on training data
- Test_F1 : F1 score on testing data
- KFold_score : KFold Cross validation score
- SKF_score : SKF Cross validation score
In above table : train_acc, test_acc represents training and testing data accuracies and other output parameters for various models with and without PCA transformation. It can be infered that almost all the models are doing well in terms of accuracy, recall and precision except KNN on original data.
Q.6.B. Select the final best trained model along with your detailed comments for selecting this model.¶
Best Model : SVM (Tuned on PCA data) [train_acc:1, test_acc:0.99, train_recall:1, test_recall, train_precision:0.99, test_precision, Train_F1:1, Test_F1:0.99, KFold_score:0.97, SKF_score:0.97] Data used : PCA (65 features out of 202), Balanced, Standardised
Looking at the results dataframe, it's evident that the Random Forest model with PCA achieves the highest performance. Both SVM and Random Forest models show comparable performance across accuracy, recall, and precision for both classes. Also, Random Forest ran on whole data is showing near perfect results. However, the SVM (Tuned on PCA data) stands out as the best choice here. By leveraging PCA-transformed data with reduced features, it maintains performance without sacrificing accuracy. This approach not only conserves computational resources but also ensures efficient processing, making it a preferable option in terms of both performance and resource utilization.
Considerations for chosing best model:
- Persistance of performance at lesser computational cost. Model performance on reduced data.
- As defined in goal statement, Recall for class 1 is considered for selecting best model. Balance between recall, precision, F1 score.
- Cross validation score
Q.6.C. Pickle the selected model for future use¶
# Save the model to disk using Pickle
with open('selected_model.pkl', 'wb') as f:
pickle.dump(rf_classifier_pca, f)
print("Model saved successfully.")
Model saved successfully.
We have checked and confirmed that .pkl file have been saved successfully.
Q.6.D. Write your conclusion on the results.¶
Based on the performance results of various models, it's evident that certain models outperform others in terms of accuracy, recall, precision, and F1-score. Here's a concise conclusion based on the provided model performance results:
SVM (Tuned on PCA data): This model exhibits exceptional performance across various metrics, boasting perfect accuracy (1.00) and high scores for recall, precision, and F1-score on both training and test datasets. It achieves an outstanding KFold score of approximately 0.97, indicating robust performance across different cross-validation folds.
Random Forest Model with PCA: Despite dimensionality reduction through PCA, this model maintains a high level of performance, with accuracy, recall, precision, and F1-score scores consistently above 0.98. The KFold score of around 0.96 further validates its stability across cross-validation folds.
Logistic Regression (Tuned on PCA Data): While this model demonstrates a decrease in performance compared to the Random Forest models, it still achieves respectable scores, with accuracy and F1-score around 0.75. However, it falls short in terms of recall and precision, suggesting room for improvement.
SVM (Support Vector Machine): The SVM model, both in its original form and when tuned on PCA data, showcases high accuracy (1.00) and strong recall, precision, and F1-score. However, its KFold score is comparatively lower, indicating potential variability in performance across different cross-validation folds.
Naive Bayes (Tuned on PCA Data): This model, while achieving decent scores, falls short of the performance exhibited by Random Forest and SVM models. With accuracy and F1-score around 0.77, it demonstrates relatively lower recall and precision, suggesting scope for enhancement.
Decision Tree (Tuned on PCA Data): Similar to Naive Bayes, the Decision Tree model displays moderate performance, with accuracy and F1-score around 0.87. While it achieves perfect recall, its precision is slightly lower, indicating potential for refinement.
KNeighbors Classifier (Tuned on PCA Data): This model exhibits improved performance compared to its non-tuned counterpart, with accuracy and F1-score around 0.90. However, its recall and precision scores remain comparatively lower, highlighting areas for improvement.
In summary, the SVM (Tuned on PCA data) model emerges as the top performer, followed closely by the Random Forest Model with PCA. Random Forest Model with PCA is also doing perfect on train and test but cross validation score for SVM coming below average. These models demonstrate robust performance across various metrics and exhibit promising potential for predictive analytics in the given context.
Factors that helped best performance for model
- Data processing : Removing unnecessary features
- Balancing data
- Standardizing data
- PCA dimensionality reduction
- Hyperparameter tuning